VOLUME 4 2006
SPAAN FELLOW Working Papers
IG N•
EL
G
CH
A
I T ESTI
N
University of Michigan, English Language Institute
• MI
in Second or Foreign Language Assessment
Spaan Fellow Working Papers in Second or Foreign Language Assessment Volume 4 2006
Edited by Jeff S. Johnson
Published by English Language Institute University of Michigan 401 E. Liberty, Suite 350 Ann Arbor, MI 48104-2298
[email protected] www.lsa.umich.edu/eli
First Printing, August 2006 © 2006 by the English Language Institute, University of Michigan. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the publisher. The Regents of the University of Michigan: David A. Brandon, Laurence B. Deitch, Olivia P. Maynard, Rebecca McGowan, Andrea Fischer Newman, Andrew C. Richner, S. Martin Taylor, Katherine E. White, Mary Sue Coleman (ex officio)
Table of Contents Spaan Fellowship Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Previous Volume Article Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iii
Lingyun Gao Toward a Cognitive Processing Model of MELAB Reading Test Item Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Shudong Wang Validation and Invariance of Factor Structure of the ECPE and MELAB across Gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Christopher Weaver Evaluating the Use of Rating Scales in a High-Stakes Japanese University Entrance Examination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Taejoon Park Detecting DIF across Different Language and Gender Groups in the MELAB using the Logistic Regression Method . . . . . . . . . . . . . . . . . 81 Liz Hamp-Lyons and Alan Davies Bias Revisited. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
i
The University of Michigan
SPAAN FELLOWSHIP FOR STUDIES IN SECOND OR FOREIGN LANGUAGE ASSESSMENT
In recognition of Mary Spaan’s contributions to the field of language assessment for more than three decades at the University of Michigan, the English Language Institute has initiated the Spaan Fellowship Fund to provide financial support for those wishing to carry out research projects related to second or foreign language assessment and evaluation. The Spaan Fellowship has been created to fund up to six annual awards, ranging from $3,000 to $4,000 each. These fellowships are offered to cover the cost of data collection and analyses, or to defray living and/or travel expenses for those who would like to make use of the English Language Institute’s resources to carry out a research project in second or foreign language assessment or evaluation. These resources include the ELI-UM Testing and Certification Division’s extensive archival test data (ECCE, ECPE, and MELAB) and the Michigan Corpus of Academic Spoken English (MICASE). Research projects can be completed either in Ann Arbor or off-site. Applications are welcome from anyone with a research and development background related to second or foreign language assessment and evaluation, especially those interested in analyzing some aspect of the English Language Institute’s suite of tests (MELAB, ECPE, ECCE, or other ELI-UM test publications). Spaan Fellows are likely to be international second or foreign language assessment specialists or teachers who carry out test development or prepare testing-related research articles and dissertations; doctoral graduate students from one of Michigan’s universities who are studying linguistics, education, psychology, or related fields; and doctoral graduate students in foreign language assessment or psychometrics from elsewhere in the United States or abroad. For more information about the Spaan Fellowship, please visit our Web site: www.lsa.umich.edu/eli/research/spaan
ii
Previous Volume Article Index A Construct Validation Study of Emphasis Type Questions in the Michigan English 1, 25–37 Language Assessment Battery Shin, Sang-Keun A Construct Validation Study of the Extended Listening Sections of the ECPE and MELAB Wagner, Elvis
2, 1–25
A Summary of Construct Validation of an English for Academic Purposes Placement Test Lee, Young-Ju
3, 113–131
A Validation Study of the ECCE NNS and NS Examiners’ Conversation Styles from 3, 73–99 a Discourse Analytic Perspective Lu, Yang An Empirical Investigation into the Nature of and Factors Affecting Test Takers’ Calibration within the Context of an English Placement Test (EPT) Phakiti, Aek
3, 27–71
An Investigation into Answer-Changing Practices on Multiple-Choice Questions with Gulf Arab Learners in an EFL Context Al-Hamly, Mashael, & Coombe, Christine
1, 83–104
An Investigation of Lexical Profiles in Performance on EAP Speaking Tasks Iwashita, Noriko
3, 101–111
Development of a Standardized Test for Young EFL Learners Fleurquin, Fernando 1, 1–23 Effects of Language Errors and Importance Attributed to Language on Language and Rhetorical-Level Essay Scoring Weltig, Matthew S.
2, 53–81
Evaluating the Dimensionality of the Michigan English Language Assessment Battery Jiao, Hong
2, 27–52
Investigating Language Performance on the Graph Description Task in a SemiDirect Oral Test Xi, Xiaoming
2, 83–134
Investigating the Construct Validity of the Cloze Section in the Examination for the Certificate of Proficiency in English Saito, Yoko
1, 39–82
Language Learning Strategy Use and Language Performance on the MELAB Song, Xiaomei
3, 1–26
Switching Constructs: On the Selection of an Appropriate Blueprint for Academic Literacy Assessment Van Dyk, Tobie
2, 135–155
iii
Toward a Cognitive Processing Model of MELAB Reading Test Item Performance Lingyun Gao University of Alberta
This study develops and tests a model of cognitive processes hypothesized to underlie MELAB reading item performance. The analyses were performed using the reading section on two MELAB forms using a three-pronged procedure: (1) reviewing theoretical models of L2 reading processes and constructs of L2 reading ability, (2) analyzing cognitive demands of the MELAB reading items and collecting verbal reports of the cognitive processes actually used by examinees when correctly answering the items, and (3) examining the relationship between the proposed cognitive processes and empirical indicators of item difficulty using a cognitively based measurement model, the tree-based regression (TBR). While the results were inconsistent across forms, the processes of drawing inferences and evaluating alternative options accounted for a significant amount of the variance in MELAB reading item difficulty on the two forms. Results of this study inform the construct validation of the MELAB reading and item construction, and lay a foundation for the MELAB reading as a diagnostic measure.
Large-scale assessments are widely used for a variety of purposes such as admissions, matching students to appropriate instructional programs, and enhancing learning (National Research Council [NRC], 2001). Assessment results typically provide a percentile rank to reveal where an examinee stands relative to others, or a numeric score to indicate how the examinee has performed. The one challenge with most large-scale assessments, however, is the lack of capacity to interpret more complex forms of evidence derived from examinees’ performance (Embretson & Gorin, 2001; NRC, 2001). Consequently, assessments provide very limited information to test developers and users, the validity of the inferences drawn from the assessment results is frequently questioned, and the usefulness of the assessments as a learning tool is compromised (Alderson, 2005a; Gorin, 2002; Strong-Krause, 2001). In the last decade, within the language testing and measurement communities, there has been a growing emphasis on the union of cognitive psychology and assessments to yield meaningful information regarding examinees’ knowledge structure, skills, and strategies used during task solving (Cohen & Upton, 2005; Douglas & Hegelheimer, 2005; Embretson, 1999; Leighton, 2004; Mislevy, 1996; Mislevy, Steinberg, & Almond, 2002; Nichols, 1994; NRC, 2001). One approach to achieving this goal has been to model item statistical properties, in particular item difficulty, in terms of the cognitive processes involved in item solving (Embretson, 1998; Huff, 2003). To date, a number of models have been developed linking item statistics to item features for a variety of foreign/second language assessments (e.g., Carr, 2003; Kostin, 2004). However, only a few models have linked item statistics to the cognitive Spaan Fellow Working Papers in Second or Foreign Language Assessment, Volume 4, 2006 English Language Institute, University of Michigan
1
structure of test items (e.g., Sheehan & Ginther, 2001), and many of these models are limited by the concepts and methods employed. Conceptually, due to the gaps among cognitive psychology, measurement, and subject areas, many of the existing models fail to incorporate the most current cognitive theories in a particular domain, which are critical for defining item features and interpreting the models. Methodologically, some of the most useful methods that cognitive psychologies use to understand human thought, such as task analysis, protocol analysis, and the study of reaction times, have not been widely used to explain test item performance. In addition, due to technical complexity, advanced measurement models incorporating cognitive elements, such as the rule-space model (Tatsuoka, 1995), tree-based regression (Sheehan, 1997), and Bayes inference networks (Mislevy, Almond, Yan, & Steinberg, 1999) have not been widely applied to assessment practice. Much work is required to link critical features of cognitive models specific to a substantive testing context to new measurement models and to observations that reveal meaningful cognitive processes in a particular domain (NRC, 2001). The Michigan English Language Assessment Battery (MELAB) is developed by the English Language Institute at the University of Michigan (ELI-UM) to assess the advancedlevel English language competence of adult nonnative speakers of English who will use English for academic study in an American university setting. The MELAB is used primarily for higher education admission, and the assessment results are widely accepted as evidence of English competence by educational institutions in the countries where English is the language of instruction. The MELAB consists of Part 1, composition, Part 2, a listening test containing 50 multiple-choice items, Part 3, a grammar/cloze/vocabulary/reading test containing a total of 100 multiple-choice items, and an optional speaking test. Compositions and speaking tests are scored by trained raters using rating scales. Answer sheets for Part 2 and Part 3 are computer scanned and raw scores are converted to scale scores. The MELAB reports a score for each part and the final score, which is the average of the scores on the three parts. Current MELAB score reporting provides some information on examinees’ English competence and describes examinees’ competence in writing and speaking to some extent. However, a numeric score for Part 2 and Part 3 provides limited information to examinees, admissions officers, and other stakeholders regarding examinees’ strengths and weaknesses in listening and especially in reading, where a subscore is lacking. Reading is a major part of language acquisition and language use activity in everyday life (Grabe & Stoller, 2002). In the context of using English as a second or foreign language for academic purposes, reading tends to be the single most important language skill and language use activity that nonnative English speakers need for academic activities (Carr, 2003; Cheng, 2003). Hence, the nature of reading in a second or foreign language and how to assess it on large-scale high-stakes tests have been a primary concern for language researchers and testers (Alderson, 2000; 2005a; 2005b; Bernhardt, 2003; Cohen & Upton, 2005). The purpose of this study is to model the cognitive processes underlying performance on the reading items included in the MELAB using a cognitive-psychometric approach. The specific research questions (RQ) are: 1. What cognitive processes are required to correctly answer the MELAB reading items? 2. What cognitive processes are actually used by examinees when they correctly answer the MELAB reading items? How are they related to the findings in response to RQ 1? 3. To what extent do the cognitive processes used to correctly answer the MELAB reading items explain the empirical indicators of item difficulty? 2
Conceptualizing the Cognitive Processes Involved in L2 Academic Reading Information-Processing Perspectives on Reading Over the last couple of decades, the shift in psychology from a behavioral to a cognitive orientation has impacted enormously the understanding of reading. Bottom-up processing is an immediate left-to-right processing of the input data through a series of discrete stages (Ruddell, Ruddell, & Singer, 1994). Early theories viewed reading as bottom-up processing in which a reader passively and sequentially decoded meanings from letters, words, and sentences (e.g., Anderson, 1972; LaBerge & Samuels, 1974). Reading processes were considered to be completely under the control of the text and had little to do with the information possessed by a reader or the context of discourse (Perfetti, 1995). Top-down processing is information processing in which readers approach the text with existing knowledge, and work down to the text (Hudson, 1998). The top-down view of reading emphasizes readers’ contributions over the incoming textual information. Two representative examples of top-down processing are psycholinguistic models (e.g., Goodman, 1967, Smith, 1971) and schema-theoretic models (e.g., Carrell, 1983a; 1983b). Psycholinguistic models stress the interaction between language and thought, especially readers’ inferential abilities, and describe reading processes as active, purposeful, and selective (Smith, 2004). Schema-theoretic models describe the reading process through the activation of schemata (i.e., networks of information organized in memory) and stress the centrality of readers’ language and content knowledge. While reading, readers apply their schemata to the text, confirm and disconfirm, and map the incoming information from the printed text onto their previously formed knowledge structures to create meaning (Hudson, 1998). Schema theory is valued at attempting to explain the integration of the new information with old, but fails to explain how completely new information is processed (Alderson, 2000). Critics of schema theory point out that it does not lead to an explicit account of reading processes due to a vague definition of schema, elision of readers’ intentionality, and oversimplification of the memory retrieval and storage processes (Phillips & Norris, 2002). Carver (1992) argues that schema theory applies only when reading texts are relatively hard, such as the situation where college-level students study academic texts. More recent theories of reading stress the simultaneous interaction between bottom-up and top-down processing (e.g., Johnston, 1984; Rumelhart, 1977; 1980; Stanovich, 1980; 2000). According to the interactive theories, readers’ multiple sources of knowledge (e.g., linguistic knowledge and world knowledge) interact continuously and simultaneously with text. Current reading theories acknowledge the interactive nature of processing, and emphasize the importance of purpose and context to fluent reading (e.g., Alderson, 2000; Enright, Grabe, Koda, Mosenthal, Mulcahy-Ernt, & Schedl, 2000; Hudson, 1998). As Butcher and Kintsch (2003) note, reading is the interaction among a variety of top-down and bottom-up processes, where readers’ knowledge, cognitive skills, strategy use, and purpose of reading are crucial during the process of reading and must be taken into account when modeling text processing. Models of the L2 Reading Processes The conceptualization of reading has been evolving over the years, so have the models of the L2 reading processes. Current models of the L2 reading processes generally include language knowledge, background/topical knowledge, cognitive skills, and cognitive and 3
metacognitive strategies. Language knowledge consists of a number of relatively independent components, such as the knowledge of phonology, vocabulary, syntax, and text structure. Major components in current models of the L2 reading processes are discussed below. Word Recognition Word recognition has been considered central to fluent reading in current models of reading processes of skilled adult L2 readers (e.g., Alderson, 2000; Grabe, 2002; Hudson, 1996; Koda, 2005; Urquhart & Weir, 1998). It is the process of recognizing strings of letters in print and being able to rapidly identify meanings from visual input (Rayner & Pollatsek, 1989). Unlike skilled adult L1 readers who are generally assumed to have phonological access to the lexicon and are familiar with the script, L2 readers encounter words that they have not heard pronounced and scripts that they are not familiar with in many cases (Urquhart & Weir, 1998). Hence, L2 readers are expected to experience greater difficulty in processing letters in a word and identifying word meanings (Alderson, 2000). In addition, unlike skilled adult L1 readers for whom the words encountered are normally in their lexicon, L2 readers have to handle unfamiliar vocabulary (Urquhart & Weir, 1998). In the context of academic reading, where large amounts of academic texts need to be processed, recognizing words and word meaning is extremely important. Inefficient word recognition and insufficient vocabulary would likely result in inefficient academic reading (Hudson, 1996). Knowledge of Syntax Readers must process syntax to impose meaning on the recognized words (Urquhart & Weir, 1998). Syntax is the component of a grammar that determines the way in which words are combined to form phrases and sentences (Radford, 2004). In L2 reading, syntactic knowledge is crucial for successful text processing and has been included in many models of the L2 reading processes (e.g., Alderson, 2000; Grabe, 1991; Hudson, 1996; Koda, 2005). Knowledge of Textual Features Readers’ knowledge of textual features, such as cohesion and text structure, has long been considered important in text processing (Alderson, 2000; Koda, 2005) and critical to successful L2 academic reading (Hudson, 1996). Cohesion refers to “the connections between sentences,” which are furnished by pronouns that have antecedents in previous sentences, adverbial connections, known information, and knowledge shared by the reader (Kolln, 1999, p. 271). Frequently used cohesive devices include reference, substitution, ellipsis, and conjunction. According to Thompson (2004), reference is the set of grammatical resources used to repeat something mentioned in the previous text (e.g., the pronoun “it”) or signal something not yet mentioned in the text (e.g., the nondefinite article “A” in the sentences “They came again into their bedroom. A large bed had been left in it”). Substitution refers to the use of a linguistic token to replace the repetitive wording (e.g., “I think so”). Ellipsis is the set of grammatical resources used to avoid the repetition of a previous clause (e.g., “How old is he? Two years old”). Conjunction refers to the combination of any two textual elements into a coherent unit signaled by conjunctives (e.g., however, by the way, thus). Research has shown that coherent texts contribute to understanding, while ambiguous references, indistinct relationships between elements of the text, and the inclusion of irrelevant ideas or events hinder comprehension (Hudson, 1996; McKeown, Beck, Sinatra, & Losterman, 1992).
4
Coherence of a text depends on not only cohesive devices but also text structure and organization patterns; that is, how the sentences and paragraphs relate to each other and “how the relationships between ideas are signaled or not signaled” (Alderson, 2000, p. 67). Example text structures include cause/effect, general/specific, problem/solution, comparison/contrast, and the use of definitions, illustrations, classifications, and topic sentences. Research has shown that the internal logic of text structures (strong or weak), organized patterns (tight or loose), and location of information within the text structures (earlier or later) affect processing and understanding (e.g., Carrell, 1984, 1985; Roller, 1990; Hudson, 1996). Background Knowledge and Subject Matter/ Topic Knowledge In addition to knowledge of language, readers’ background knowledge (i.e., knowledge that may or may not be relevant to the text content) and subject matter/topic knowledge (i.e., knowledge directly relevant to the text content) play a crucial role in L2 reading, especially in L2 academic reading where the reading materials are relatively difficult and the primary concern is to predict examinees’ performance on reading tasks involved in academic study (Alderson, 2000; Urquhart & Weir, 1998). According to schema theory and the interactive notion of reading, readers apply their preexisting knowledge when processing texts, which influences the process in which new information is recognized and stored and affects text understanding. Background and topical knowledge have been included in many models of the L2 reading processes (e.g., Grabe, 2002; Hudson, 1996). Cognitive Skills In addition to knowledge, readers have skills to learn and process new information in the text. Cognitive skills have long been held as important components of the reading process. For example, Thorndike (1917) stated that reading was reasoning. He explained that readers’ skills to construct meaning approximated logical inference and deduction, and that good readers thought clearly. Cognitive skills enable L2 readers to use information in their mind and cues from the text to fill the gap of understanding and monitor their reading processes (Alderson, 2000). Over the last several decades, skills have been a major area in reading research and various taxonomies of L2 reading skills have been developed (e.g., Carver, 1992; Farhady & Hessamy, 2005; Grabe, 1991; Koda, 1996; Munby, 1978). These skill taxonomies provide a framework for reading test construction. However, critics argue that the skills in many of these taxonomies are ill defined, have enormous overlap with one another, and lack empirical support (Alderson, 2000). Despite the criticisms, skills such as inference, synthesis, and evaluation are frequently included in current models of L2 reading processes (e.g., Enright et al., 2000; Hudson, 1996; Jamieson, Jones, Kirsch, Mosenthal, & Taylor, 2000). Problem-Solving Strategies In recent L2 reading literature, the strategies used by readers when processing text have received considerable attention (e.g., Abbott, 2005; Cohen & Upton, 2005; Lumley & Brown, 2004; Phakiti, 2003; Yang, 2000). A list of cognitive and metacognitive strategies that L2 readers use during reading include skimming and scanning the text to locate discrete pieces of information, monitoring progress of understanding, planning ahead how to read, selectively attending to text, and, in testing situations, testwiseness strategies (e.g., guessing and attending to the length of options). Cognitive and metacognitive strategies have been important components in many models of the L2 reading processes (Alderson, 2000; Hudson, 1996; Koda, 2005).
5
Purpose and Context In addition to knowledge, skills, and strategies, reader purpose and the context in which L2 readers engage in reading is increasingly being emphasized (e.g., Alderson, 2000; Cohen & Upton, 2005; Enright et al., 2000; Hudson, 1996). These researchers stress that reading is usually undertaken for some purpose and in a specific context, which affects the knowledge and skills required, strategies used, and the understanding and recall of the text. Conceptualizing L2 Academic Reading Ability The Constructs of L2 Academic Reading Ability Models of the L2 reading processes suggest a range of constructs of L2 reading ability, which has been operationalized differently in tests of L2 academic reading (e.g., Cohen & Upton, 2005; Douglas, 2000; Enright et al., 2000; Hudson, 1996; Jamieson et al., 2000). It is currently well accepted that word recognition skills, which are critical to fluent reading, need to be tested. Language knowledge is essential for L2 readers’ understanding of academic texts, and hence should be measured. Knowledge of formal discourse structure should be taken into account in testing L2 academic reading. Cognitive skills and cognitive and metacognitive strategies are important for L2 readers to overcome the language difficulties, especially when reading difficult academic texts. Hence, L2 academic reading tests should allow examinees to apply their cognitive skills and strategies. Alderson (2000) stresses that in the context of L2 reading, sufficient knowledge of the second or foreign language, cognitive skills, and problem-solving strategies are especially important. Nevertheless, Alderson reminds us that readers’ background knowledge is normally not included in the constructs to be assessed, though its influence on the L2 reading process and product is recognized. A Theoretical Framework of Communicative Language Competence According to the MELAB Technical Manual (English Language Institute, 2003), the framework for developing the MELAB is closely related to the framework of communicative language ability (CLA) proposed by Bachman (1990) and later revised by Bachman and Palmer (1996). Bachman (1990) proposed the framework of CLA, which acknowledges the competence in the language and the capacity for using this competence in contextualized language use (see Figures 4.1 and 4.2, pp. 85–87). Specifically, Bachman’s framework of CLA includes language competence, strategic competence, and psychophysiological mechanisms, and describes the interactions of these components with the language user’s knowledge structures and language use context. Language competence includes organizational competence, which consists of grammatical and textual competence, and pragmatic competence, which consists of illocutionary and sociolinguistic competence. Strategic competence performs “assessment, planning, and execution functions in determining the most effective means of achieving a communicative goal” (p. 107). Psychophysiological mechanisms are “the channel (auditory, visual) and mode (receptive, productive) in which competence is implemented” (p. 108). Bachman and Palmer (1996) extended Bachman’s (1990) framework and clearly defined language use as the dynamic creation of intended meanings in discourse by an individual in a particular situation. According to Bachman and Palmer, purpose and context of language use are crucial in defining language ability. They stress that to make inferences about language ability based on language test performance, language ability should be defined 6
in a way that is appropriate for a particular testing situation, a particular group of examinees, and a specific context in which examinees will be using the language outside of the test itself. For this study, language use occurs in the context where English competence of adult nonnative English speakers is assessed for academic purposes. Correspondingly, the reading ability includes the language knowledge and strategic competence to solve the test tasks, and the competence to apply the knowledge/competence to academic reading in the real world. In addition to the emphasis of purpose and context, Bachman and Palmer point out that language use involves complex interactions among individual characteristics of language users and the interactions among these characteristics and the characteristics of language use. Hence, “language ability must be considered within an interactional framework of language use” (pp. 61–62). They presented their framework as a theory of factors affecting performance on language tests and proposed that performance on language tests was affected by (1) the interactions among examinees’ language knowledge, topical knowledge, affective schemata, strategic competence or metacognitive strategies, and personal characteristics such as age and native language, and (2) interactions between examinee characteristics and characteristics of the language use, namely, test task. Subsequently, Bachman (2002) clearly distinguished three sets of factors that affected test performance: examinee attributes, task characteristics, and the interactions between examinee and task characteristics. The current theoretical framework of CLA (Bachman, 1990; Bachman & Palmer, 1996) is consistent with current understanding of L2 reading ability and its assessment, which acknowledges the interactive nature of reading and the effect of text and item characteristics, reader knowledge, cognitive and metacognitive strategies, and purpose and context of reading on reading test performance (Alderson, 2000; Enright et al., 2000; Hudson, 1996; Jamieson et al., 2000; Koda, 2005). Research into L2 Reading Test Item Performance Methods and Issues Concerning This Research Over the last decade, language testers have been researching item performance in reading tests. This research has yielded a number of factors that appear to affect item performance across a variety of reading tests. However, limited by the concepts and methods employed, little progress has been made on our understanding of L2 reading test item performance (Bachman, 2002). Conceptually, current theories of reading recognize the interactions between reader and text and emphasize purpose and context of reading (Alderson, 2000). Moreover, current theories of language testing consider task performance as a function of interactions between examinee attributes and test task characteristics (Bachman, 2002). However, many studies of L2 reading test item performance either focus on the characteristics inherent in the text and/or item itself without taking examinees into account, or vice versa. In addition, the varying purposes and contexts of reading tasks were not given proper attention. Methodologically, many of the studies are limited in the analyses employed. Representative studies of L2 reading test item performance are critically reviewed below. Studies of Surface Task Feature and Item Performance Studies of surface task features and item performance typically identify a number of text and/or item features and then investigate the effect of these features on item statistics using quantitative methods, such as the commonly used multiple linear regression (e.g., Freedle & Kostin, 1993; 1999). The findings of these studies have clear implications for the 7
design of L2 reading tests. However, due to overreliance on surface features of texts and items without taking readers into account, the analyses fail to reveal the processes of item solving. In addition, multiple linear regression analysis has its limitations, such as oversensitivity to the presence or absence of an item feature variable (Kasai, 1997) and strict requirements for linearity and the number of items (Keppel & Zedeck, 2001). Freedle and Kostin (1993) examined the effect of task features on the difficulty of TOEFL reading items, as measured by equated delta (n items = 213; n examinees = 2000). Based on a review of previous studies predicting the difficulty of multiple-choice reading test items, they hypothesized that 12 categories of 65 text, item, and text-by-item interaction variables might influence the difficulty of TOEFL reading items. After a multiple-regression analysis, they found that 58% of the variance in item difficulty was explained by eight categories of text and text-by-item variables: negations, referentials, rhetorical organizers, sentence length, passage length, paragraph length, lexical overlap between text and options, and location of relevant text information. Their investigation of reading item difficulty as a function of text, item, and text-item interaction impacted later research and their findings have direct implication for text writing and item design. However, the variables used, which were mainly word counts (e.g., the number of words in the correct option), fail to reveal the complex processes of item solving and lack interpretive and diagnostic value (Kasai, 1997). Carr (2003) examined task features in explaining the difficulty of 146 reading items included in three TOEFL test forms. Based on a review of previous research, he developed a rating instrument consisting of three sets of 311 passage, key sentence, and item variables, most of which were word and sentence counts. He asked five graduate students in applied linguistics to rate the task features using the rating instrument. However, only passage and key sentence variables were included in his analysis, as text features were considered most relevant to fluent reading and most reflective of the target language use domain. Through exploratory and confirmatory factor analyses, he constructed and tested a factor model of the text features and concluded that passage content, syntactic features of key sentences, and vocabulary factors contributed to the difficulty of the TOEFL reading items. Carr provides a thorough list of text variables that may affect the difficulty of L2 reading test items and an alternative method for investigating the effect of task features on reading item difficulty. However, excluding item variables from the analysis does not seem to be warranted, since the complete task of multiple-choice reading tests involves text, question stem, and options, and examinees’ mental processes used to answer multiple-choice items may differ from those used to answer constructed response or essay questions (Kasai, 1997). In addition, like Freedle and Kostin’s (1993) study, a focus on the surface features of text provided limited information regarding examinees’ cognitive processes during item solving. Studies of Cognitive Demands of Test Items and Item Performance Studies of cognitive demands of test items and item performance typically identify item features that are essentially cognitive demands hypothesized to affect the performance of a given item (e.g., Alderson, 1990; Alderson & Lukmani, 1989; Bachman, Davidson, & Milanovic, 1996; Bachman, Davidson, Ryan, & Choi, 1995). These studies used “expert” ratings of the test items that included different combinations of the cognitive demands, and then related the ratings to item performance using cross-table or multiple linear regression analysis. “Experts” in these studies have included various individuals such as EFL teachers or administrators and graduate students in applied linguistics or educational psychology. Results 8
of these studies consistently indicate no systematic relationship between “expert” ratings and item statistics. The equivocal results are likely caused by methodological limitations. For example, no measurement models that incorporate the cognitive elements of items were used in relating the ratings and item statistics. In addition, item statistics calculated using the classical test theory model have little connection with the cognitive structure of an item (Embretson, 1999). Finally, experts may process the test tasks differently from the target examinees (Alderson, 2000; 2005a; Leighton & Gierl, 2005). Nevertheless, these studies begin to pay attention to the effect of cognitive elements of test items on item performance, which anticipates the cognitive processes used by examinees when they answer test items and precludes the study of item performance in light of examinees’ cognitive processes. Expert analysis may reveal both automatic and controlled processes evoked by test items (Leighton, 2004). As automatic processes are inaccessible for description through conscious verbal reports (Cheng, 2003), analysis of the cognitive demands of an item provides valuable sources of data to supplement verbal reports. Alderson and Lukmani (1989) investigated the cognitive skills required for correctly answering the reading items included in a L2 communication skills test taken by 100 students at Bombay University (India), and related the skill requirements of individual items to item difficulties, as measured by percentage of correct responses. Nine teachers at Lancaster University (Great Britain) were asked to describe what skills were being tested by each of the 41 test items. Results showed little agreement among the judges and little relationship between item difficulty and skill requirements of each item. The lack of a prestructured rating guide and pretraining of the judges is a likely reason for the results. In addition, the judges may not have been familiar with how students processed the test task. Using a rating instrument containing 14 reading skills, Alderson (1990) conducted a similar study, in which 18 teachers of ESL were asked to decide a single skill being tested by each of 15 short answer questions on two British language proficiency tests. Again, little agreement was reached among the judges and little relationship was found between item difficulty and skill requirements of the items. Two likely reasons for the results might be: (1) correctly answering an item may require multiple skills, while the judges were allowed to specify one skill for each item, and (2) there was enormous overlap among the skills provided on the rating instrument, which affected the accuracy of expert rating. The studies reviewed above question the ability of experts to determine the skills being tested by an item. Other studies have reported high levels of agreement among expert judges by using well-designed and clearly defined rating instruments, extensive discussion, and exemplification (e.g., Bachman et al., 1995; 1996; Carr, 2003). In Bachman et al.’s (1996) study, five trained applied linguists with experience as EFL teachers were asked to analyze the characteristics of 25 vocabulary and 15 reading items and passages on each of the six parallel forms of an EFL test. The number of examinees for each form ranged from 431 to 1099. A refined rating instrument was presented to the raters, which contained 23 test task features (TTF) and 13 communicative language abilities (CLA) defined using Bachman’s (1990) framework. Rater agreement was checked using generalizability analysis and rater agreement proportion. Results showed that the overall rater agreement was very high and that the TTF ratings were more consistent than the CLA ratings. They related the TTF and CLA ratings to the IRT item parameter estimates calibrated using the 2PL model. Stepwise regressions were performed for all items and for vocabulary and reading items separately, by individual form, and for all forms combined. Results showed that neither TTC nor CLA ratings consistently 9
predicted item difficulty or discrimination across the six forms, though a combination of the TTF and CLA ratings consistently yielded high predictions. Their study demonstrates the possibility of achieving a high level of agreement among expert judges. The use of a rating instrument and rater training appear to play an important part in rater agreement. Their study has several implications. First, more refined definitions of the abilities may increase the consistency of ability ratings. Second, the inconsistent prediction of item parameter estimates across the forms indicates that the item features identified are likely affected by differences among passages and items on different tests. A large number of tests may be examined to provide reliable item features that affect item performance. Finally, as experts may process test tasks differently from the target examinees, it is imperative to examine examinees’ actual processes underlying the correct responses (Alderson, 2000; 2005a; Leighton & Gierl, 2005). Processes in Task Performance Inferred from Verbal Reports Concurrent verbal reports (i.e., an individual’s description of the processes he/she is using during task solving) and retrospective verbal reports (i.e., the recollection of how the task was solved) have been established as valid means to obtain valuable sources of data on cognitive processing during task performance (Ericsson & Simon, 1993). Leighton (2004) recommends collecting both forms of verbal reports to triangulate the processes used to solve the tasks, using tasks of moderate difficulty to maximize the verbalization elicited, and analyzing a task’s cognitive demands prior to eliciting verbal reports to anticipate the cognitive processes a respondent will use when solving the task. The last decade has seen an increased use of verbal reports to inspect the processes of L2 readers during test taking (e.g., Abbott, 2005; Allan, 1992; Anderson, Bachman, Perkins, & Cohen, 1991; Block, 1992; Cohen & Upton, 2005; Lumley & Brown, 2004; Phakiti, 2003; Yang, 2000). These studies shed some light on the cognitive processes underlying reading test item performance and suggest a number of processes that appear to predict item statistics. However, as the test tasks differ across the studies, the research results as a whole have been inconsistent. Anderson et al. (1991) investigated the strategies used by adult EFL learners to complete a standardized reading test and then examined the relationships among strategies, item type, and item performance using a triangulation of three sources of data: retrospective verbal reports, item type, and item difficulty p and discrimination rpbi through chi-square analyses. Their results revealed a significant relationship between (1) frequencies of the reported strategies and item type, and (2) item difficulty and the strategies of skimming, paraphrasing, guessing, responding affectively to text, selecting answer through elimination, matching stem with text, selecting answer because stated in text, selecting answer based on understanding text, and making reference to time. In addition, their results showed that more strategies were reported for the items of average difficulty (0.33 ≤ p ≤ 0.67) than for the difficult items (p < 0.33) and easy items (p > 0.67). This finding appears to support the use of moderately difficult items to maximize verbal report data (Leighton, 2004). However, no significant relationship was discerned between item type and item difficulty. Their study demonstrates a triangulation approach to the construct validation of a standardized reading test. The authors recommend the use of multiple data sources and stress supplementing the traditional psychometric approach with qualitative analysis of item content and verbal reports, which has significant implications for research on standardized reading tests.
10
Item Modeling with New Concepts and Methods In response to the call for the union of cognitive psychology and assessment, there is a growing interest in modeling reading test item performance in light of the cognitive elements of an item in recent psychometric literature (e.g., Gorin, 2002; Huff, 2003; Rupp, Garcia, & Jamieson, 2001; Sheehan 1997; Sheehan & Ginther, 2001). These studies typically rely on expert identification of cognitively based item features, and then relate these features to item statistics using new measurement models that can incorporate these features. These studies have demonstrated that linking “indicator variables that distinguish the cognitive processes assumed to be involved in item solving” and “observable item performance indices, in particular, item difficulties” can provide invaluable validity information and rich sources of data for understanding the cognitive processing during task performance (Wainer, Sheehan, & Wang, 2000, p.114). However, there are several limitations with some of these studies. First, item features are simply judged by experts without being validated by examinees’ actual processes while answering items. As experts may process a task differently from the target examinees, expert judgment may not represent examinees’ actual processes underlying item performance. Second, item parameter estimates calibrated using the 2-PL or 3-PL IRT models are problematic in the case of passage-based testlets. This is because the interrelatedness among the set of items based on a common passage violates the local item independence assumption of IRT, which can cause inaccurate estimation of examinee abilities and item statistics (Lee, 2004; Kolen & Brennan, 2004; Wainer & Lukhele, 1997). Third, due to the gap between cognitive psychology, measurement, and reading, many of these studies fail to incorporate the most current cognitive theories in reading and fail to justify the item features within a framework of ability constructs. Despite the limitations, psychometric studies on modeling reading item performance with cognitively based item features and tree-based regression (TBR) measurement models offer considerable promise for research into the L2 reading test item performance. Sheehan (1997) conducted one of the first studies modeling item difficulty based on item processing features in order to develop student- and group-level diagnostic feedback. He analyzed examinee responses to 78 verbal items (40 passage-based reading, 19 analogy, and 19 sentence-completion items) on an operational form of the SAT I Verbal Reasoning Test. In his TBR analysis, the criterion was the 3-PL IRT item difficulty estimates, and the predictors were hypothesized skills required for item solution. Using a user-specified split, the items were first classified according to four processing strategies: Vocabulary, Main Idea and Explicit Statement, Inference, and Application or Extrapolation. The first split explained 20% of the observed variance in item difficulty. To explain more variance, each strategy node was split into two child nodes based on different skills within each strategy. For example, the Vocabulary strategy was further divided into Standard Word Usage and Poetic/Unusual Word Usage. This split explained about 50% more of the observed variance in item difficulty. In a subsequent study, Sheehan and Ginther (2001) successfully applied TBR to develop an item difficulty model for the Main Idea type reading items on the TOEFL 2000, based on cognitive processing features of the items. They coded the Main Idea items with three variables describing item-passage overlap features: Correspondence between correct response and textual information (0 = No Inference, 1 = Low Level inference, and 2 = High Level Inference), Location of Relevant Information (1 = Early, 2 = Middle, 3 = Late; and 4 = Entire Passage), and Elaboration of Information (scored as the percent of text that must be processed to correctly answer the item). The resulting cognitive processing model accounted 11
for 87% of the variance in item difficulty, with Correspondence as the strongest predictor and Elaboration an insignificant predictor. Rupp et al. (2001) applied TBR to model listening and reading items. Despite a small sample size (84 nonnative English speakers of varying ability levels), two strengths are unique to their study. First, they employed both TBR and multiple linear regression analyses, and thus provided multiple perspectives to more fully interpret the item difficulty model. Second, when defining the predictors, they linked cognitive demands of an item to text and item features by proposing that the processing underlying task performance was associated with text features (e.g., information density), item features (e.g., lexical overlap between correct answer and distractors), and text-by-item interactions (e.g., type of match). A limitation with their study might be the lack of strong evidence for combining the items across the modalities in item modeling. They assumed that reading and listening items could be grouped according to information processing characteristics common to both modalities. A think-aloud or dimensionality analysis may help to clarify whether modeling reading and listening item groups separately would be better in terms of interpretability of the models. Huff (2003) modeled item difficulty for the new TOEFL using TBR for the purpose of providing descriptive score reports regarding examinees’ English language proficiency. In her application, the data were examinee performances on the Listening and Reading items from two prototypical parallel forms (1,372 examinees for Form 1 and 1,331 for Form 2). Her final models accounted for 56% of the variance in item difficulty for reading items and 48% for listening items. Several features distinguish her analysis. First, both dichotomously and polytomously scored items were involved. Item difficulty parameters were estimated with the 3-PL IRT model for dichotomous items and graded response model (GRM) (Samejima, 1997) for polytomous items. Second, unlike previous TBR studies where items were classified by user specifications, Huff introduced cluster and dimensionality analyses to complement the subjective judgment of item classifications. Her study showed that dimensionality analyses facilitated item grouping and substantive interpretations of item modeling solutions. Third, the predictors used in her TBR analysis were the existing item and passage codes developed by the TOEFL developers. These predictors included item and text features and were defined using Mislevy’s (1994) framework of evidence-centered design and Bachman’s (1990) framework of CLA. However, as these existing codes were not defined specifically for item difficulty research, factors affecting reading/listening item difficulty and the interaction between item and text—that is, what an examinee is required to do and the type of information in the text—were not taken into account. Defining item features is the fundamental issue in applying TBR to item modeling, as the item features that are included in the model and how they are coded are closely related to model interpretability (Ewing & Huff, 2004; Huff, 2003). In the assessment of reading, assessing examinees’ processes when they read passages and respond to items has been increasingly emphasized, and the methods in cognitive psychology such as task and verbal report analysis have been used to gain insights into examinees’ processes during task performance (Alderson, 2000). Accordingly, identifying cognitive processes underlying reading item performance needs to consider theoretical information, cognitive features of items, and examinees’ verbal reports about their item solving processes.
12
Method Description of the MELAB Reading Section According to the MELAB Technical Manual (English Language Institute, 2003), the reading section is designed to assess examinees’ understanding of college-level reading texts. The reading section consists of four passages, with each followed by five multiple-choice items. Each item consists of a question stem and four options (one key and three distracters). Examinees are instructed to read the passages and select the single best answer based on the information in the passages. All passages are expository texts and the language characterizes English for academic purposes. The readability of the passages, as measured by a standard readability formula, suggests that the vocabulary and structural complexity of the passages are at the college level. The topics of the passages are accessible to all examinees; no prior knowledge is required to understand a passage or solve an item. To counter any possible bias towards examinees of a particular educational or cultural background, ELI-UM selects texts on a range of topics and includes different genres of passages in each test form. According to the ELI-UM item-writing guidelines, the questions following each passage are intended to assess a variety of reading abilities, including recognizing the main idea, understanding the relationships between sentences and portions of the text, drawing text-based inferences, synthesizing, understanding author’s purpose or attitude, and recognizing vocabulary in context. In this study, the analyses were performed on the reading section of two parallel MELAB forms, Form E and Form F, administered during the years 2003 and 2004. The passages included in each form range from 229 to 265 words in length and are on topics in the social science, biological science, physical science, and agriculture subject areas. Defining the Initial Cognitive Processing Model and Cognitive Variables Following an analysis of the literature and the constructs assessed by the MELAB reading section, a theoretically supported cognitive processing model was hypothesized to underlie MELAB reading test item performance. The model was considered to have the following 10 general categories of processing components. 1. Recognize and determine the meaning of specific words or phrases using context clues or phonological/orthographic/vocabulary knowledge (PC1). 2. Understand sentence structure and sentence meaning using syntactic knowledge (PC2). 3. Understand the relationship between sentences and organization of the text using cohesion and rhetorical organization knowledge (PC3). 4. Speculate beyond the text, e.g., use background/topical knowledge (PC4). 5. Analyze the function/purpose of communication using pragmatic knowledge (PC5). 6. Identify the main idea, theme, or concept; skim the text for gist (PC6). 7. Locate the specific information requested in the question; scan the text for specific details, which includes (a) match key vocabulary items in the question to key vocabulary items in the relevant part of the text, and (b) identify or formulate a synonym or a paraphrase of the literal meaning of a word, phrase, or sentence in the relevant part of the text (PC7). 8. Draw inferences and conclusions based on information implicitly stated in the text (PC8). 9. Synthesize information presented in different sentences or parts of the text (PC9). 13
10. Evaluate the alternative choices to select the one that best fits the requirements of the question and the idea structure of the text (PC10). Based on this theoretical model and empirical studies of processing difficulty for multiple-choice reading test items (Gorin, 2002; Huff, 2003; Jamieson et al., 2000; Kirsch & Mosenthal, 1990; Sheehan & Ginther, 2001), the cognitive processing features of the MELAB reading items scored for consideration in the TBR statistical model were derived. PC1, PC2, and PC3 were coded as the degree to which the corresponding process was required to solve the item (0 = Low; 1 = Middle; 2 = High). In addition, PC1 involved a variable measured as percentage of specialized and infrequent words in the part of the text where the necessary information to solve the item is located, based on the hypothesis that text with more specialized and infrequent vocabulary items will be more difficult to process, understand, and recall when answering the questions. This variable was scored using Web VP version 2.0 (Cobb, 2004). PC5 was coded on a scale from 0 to 4 with higher numbers representing more complex processing required to solve the item. PC6 and PC7 were coded on a three-point scale (0 = the question does not request locating specific details in the text; 1 = the information requested in the question can be located in the text by identifying the lexical overlap between the question and the text; 2 = the information requested in the question can be located by identifying a synonym or a paraphrase of the literal meaning of a word, phrase, or sentence in the text). PC4 and PC8 were coded as correspondence between correct response and information in text (0 = Literal or synonymous match; 1 = Low text-based inference; 2 = High text-based inference; 3 = Prior knowledge beyond text). PC9 was coded as the degree to which synthesis was required to solve the item (0 = No synthesis; 1 = Low-level synthesize; 2 = High-level synthesize). PC10 was coded as the number of distractors that contained lexical overlap with text or ideas explicitly/implicitly stated in the text. As it was hard to reach consensus on this variable, it was counted as the average of the ratings by the three raters. Coding the MELAB Reading Items After defining the initial model and variables, three raters coded the MELAB reading items on Form E and Form F in terms of the cognitive processes required to correctly answer each item. All raters were doctoral students in educational psychology, with experience in teaching reading to adult L2 learners. To enhance rating reliability, a rating instrument was employed, which contained components of the initial cognitive model described above, definitions of the cognitive processes covered, example cognitively based item features, and scored variables for item coding. The rating instrument also allowed the raters to indicate any processes that were not included in the rating instrument but were required for item solving. Prior to formal rating, a group training session was held, during which the researcher introduced the study and two MELAB reading sections, acquainted the raters with the rating instrument, and clarified the rating procedure. Discussion was encouraged as a way of achieving common definitions and understanding. As part of the training, a sample passage with five associated items was provided for practice. The raters were asked to (1) answer the sample items, (2) mark their answers using the answer key, and (3) code the sample items using the rating instrument. Upon completion, the coding results and inconsistencies were discussed. Following that, the raters independently coded the reading items included in both forms. To ensure that the procedure was followed exactly, each rater was provided three envelopes, which contained instructions and materials for each step of the coding. Envelope A contained two MELAB reading sections. The raters were instructed to read the passages and 14
answer the items as if they were taking a reading test. Upon completing this task, they were instructed to open Envelope B, which contained the answer keys provided by the ELI-UM. The raters were asked to check and correct their answers. Upon completion, they were instructed to open Envelope C, which contained the rating instrument and instructions for item coding. The raters were asked to complete the entire task in 3 days, and to return their completed work with all the instructions and materials in the original envelopes to the researcher by the end of the third day. The researcher entered the rating data collected from the raters into SPSS 13.0 (SPSS, 2005) and verified for 100% accuracy. Rater consistency was examined using generalizability theory (G-theory). G-theory offers a more comprehensive framework for studying the rater data. With G-theory, rater performance can be studied across a number of different factors, such as cognitive processes and items. Finally, a meeting was held for the raters to look at the coding summary conducted by the researcher by hand and to reach consensus on the item codes for which there was a lack of agreement. Following the meeting, the researcher entered the consensus codes into the Microsoft Excel 2000 and verified to ensure 100% accuracy. Validating the Cognitive Processes through Verbal Reports To ensure that the cognitive model and the associated cognitive variables are faithful descriptors of examinees’ cognitive processes, concurrent and retrospective verbal reports were collected from individual participants as they worked through the MELAB reading items. The participants were 10 Chinese-speaking students (4 male, 6 female) enrolled in an undergraduate or graduate program at the University of Alberta in fall 2005. They ranged in age from 19 to 32, received at least 11 years of basic education and at least eight years of English education in China, and had resided in English-speaking countries for no longer than six months. The participants were randomly assigned to take Form E and Form F, with an equal number of participants for each form. Data collection took the form of administering Form E or Form F of the MELAB reading test and asking participants to report, in Chinese, English, or both, what they were thinking as they answered the items and what they thought while answering the items after completing each item. To avoid the possibility that researcher probes could lead the participants, nonmediated verbal reports were used. Given that the original form containing 20 items was too long for both concurrent and retrospective verbal reports, data from each participant were collected during two separate sessions scheduled on 2 different days within a week, with the first 10 items administered on day one and the second 10 items on day two. On day one, the researcher met with a single participant in an empty office at the university. The researcher and participant sat side by side at a table on which there were a digital audio recorder, a microphone, and a folder containing the experimental materials. These materials included a sheet of directions, two practice questions, and the test form. The researcher first explained the nature and procedure of the task. Given that participants may not have been familiar with the verbal report methods, the researcher provided an opportunity for them to practice verbal reporting skills, using two questions presented in Ericsson and Simon (1993). The researcher asked them to report aloud their thinking and the information they were attending to while answering the sample items (concurrent reports). After selecting an answer to an item, the researcher asked them to report their remembrance about their thoughts and the places they were attending to from the time they began to read the question until they selected an answer (retrospective reports). The participants were asked to answer the items as 15
if in a real testing situation and to verbalize whatever was on their minds while and after completing each item. The participants practiced the verbal report procedures using the sample questions. If the participants remained silent for a lengthy time period, they were reminded to keep talking. Once the participants became accustomed to the reporting procedures and had no more questions, they were administered the first two passages with their associated 10 items from Form E or Form F and the digital audio recorder was turned on. Participants were asked to read the passages silently, verbally express their thought processes while responding to the items, and upon completing each item, retrospectively describe aloud what they remembered about their thought processes used to answer the item. If the participants remain silent for a lengthy time period, they were prompted to keep talking. On day two, following the same procedure, the participants completed the remaining two passages with their accompanying 10 items. To maximize the consistency among the sessions, standardized procedures and instructions were followed for each session. The participants’ verbal reports were transcribed verbatim and typed into the computer for analysis. The researcher reviewed all verbal reports and coded them for the cognitive processes used to answer each item. The initial cognitive processing model was used as a starting point for classifying the verbal report data. Statements or phrases in the reports associated with each cognitive process were segmented and assigned a code. Additional processes gleaned from the transcripts were categorized and added to the model. After the verbal report data were coded and the additional processing categories developed, the cognitive processing model was revised as necessary and then used as a coding scheme to recode the previously coded data by the researcher. To evaluate the coding reliability, an independent rater (i.e., a colleague of the researcher, who has comparable expertise as the researcher and no experience with the study) was invited to code 40% of the verbal report data. The independent rater was first trained in the data coding procedures. During the training, the researcher discussed the coding scheme (i.e., the list of 10 processing components in the initial cognitive model with definitions and examples of each) with the rater, demonstrated the coding, and provided a chance to practice using the verbal reports from one of the participants. After the training, the rater independently coded four randomly selected verbal reports, using the coding scheme, and then his codes were compared to those of the researcher. The percent of interrater agreement was calculated to evaluate the consistency of the verbal report coding. To determine the final set of item features, the cognitive processes obtained through the analysis of verbal reports were compared to those obtained through rater coding. Consistent findings were checked, complementary findings combined, and contradictory findings replaced with the cognitive processes inferred from the verbal reports. Based on the results of the comparison, the cognitive model and the rating instrument were refined as necessary. Next, the three previous raters met together with the researcher to review the changes about the model and the rating instrument, and to recode the items using the modified rating instrument. Once consensus was made, the final list of the cognitive processes required to correctly answer each item was formatted into two 20 x k matrices (20 is the total number of items on each test form and k is the number of cognitive processes required to correctly answer the items), with one matrix for Form E and the other for Form F.
16
Developing Item Difficulty Model Data Two data files containing examinee item responses to reading items on Form E and Form F of the MELAB were provided by the ELI-UM. One data file contained item responses from 1703 examinees on Form E administered from January 2003 through September 2004. The other data file contained item responses from 1044 examinees on Form F administered from January 2003 through October 2004. Neither file contained missing data, since the examinees who did not attempt one or more of the items (3.2% of the total number of examinees) had been excluded, given the consideration that these examinees may have been simply guessing and thus were not instigating the processes required by item solution (J. Johnson, personal communication, January 18, 2005). Date Scoring and Analysis of Psychometric Characteristics Examinee response data were exported into SPSS 13.0 (SPSS, Inc., 2005). Items were scored to the key, with 0 representing the incorrect responses and 1 representing the correct responses. Descriptive statistics and reliability estimates were computed for the two reading sections using the computer program Lertap 5 (Nelson, 2000). Given the lack of local item independence due to common passages (Kolen & Brennan, 2004), item parameter estimates were calibrated using the testlet response theory (TRT) model (Wang, Bradlow, & Wainer, 2002). The TRT is a four-parameter dichotomous IRT model that introduces a testlet effect parameter, γig(j). The TRT model is expressed as:
p( y ij = 1 | θ i ) = c j + (1 − c j )
exp[a j (θ i − b j − γ ig ( j ) )] 1 + exp[a j (θ i − b j − γ ig ( j ) )]
,
where yij is the score for examinee i on item j, θi is the ability level of examinee i, p(yij= 1|θi) is the probability that examinee i at the ability level θ correctly answers item j, aj is the discrimination parameter of item j, bj is the difficulty parameter of item j, cj is the pseudoguessing parameter of item j, and γig(j) is the testlet effect parameter indicating the testlet effect for examinee i responding to item j that is nested in testlet g. The TRT model separates the testlet effect from examinee ability by estimating the testlet effect parameter (γ) for each testlet and each examinee during the calibration of the a-, b-, and c-parameters. In this way, the problem of local dependence in passage-based reading tests is attended to and the resulting item parameter estimates are more accurate (Wang et al., 2002). For this study, the parameters of the reading items were estimated separately within Form E and Form F, using the computer program SCORIGHT 3.0 (Wang, Bradlow, & Wainer, 2004). This program is used because it is based on the TRT model and can address the problem of local dependence. The item difficulty parameter estimates calibrated within each form were formatted into two 20 x 1 vectors, with one vector for Form E and the other for Form F. TBR analysis To determine the extent to which the identified cognitive processes accounted for the item difficulty estimates, two sets of TBR analyses were performed using the regression tree module available through SPSS 13.0. One set of TBR analysis was performed on Form E as the principal analysis, and the other performed on Form F to cross-validate the results. In both sets of the TBR analyses, the predictors were the 20 x k cognitive processes matrix, and the
17
criterion the 20 x 1 vector of the item difficulty estimates for the corresponding form. The analysis of Form E began with the placement of the 20 items in a single node at the top of the tree, where 0% of the variance was explained. The items were successively split into increasingly homogeneous clusters, according to the classification of the cognitive processes required for item solution. At the bottom of the tree, each item was classified into its own cluster, where 100% of the difficulty variance was explained. At each stage of splitting, a recursive partitioning algorithm (Breiman, Friedman, Olshen, & Stone, 1984) was used to evaluate all possible splits of the predictor variables. The best split was the one that resulted in the largest reduction in the deviance between the parent node and the sum of the two child nodes. The smaller the deviance value is, the more homogeneous the items within a node are. Generally, increasing the level and terminal nodes of the tree would lead to an increase of the explained variance in the difficulty. However, in order to determine the levels of the tree and the number of terminal nodes the parsimony and interpretability of the model needs to be taken into account. If adding a new level and more terminal nodes does not contribute to the improvement of the variance explained, the more parsimonious model is preferred (Huff, 2003). Hence, the final stage of the TBR analysis was to increase the parsimony and interpretability of the model by pruning, which involved removing one or more sets of child nodes and collapsing the similar terminal nodes. After completing the principal analysis with Form E, cross-validation was performed using the 20 items on Form F through the same procedure. If the final item difficulty model obtained using the Form E data can be replicated using the Form F data, then the TBR analyses will provide strong empirical support for the cognitive processing model underlying the MELAB reading test items. Results Coding the MELAB Reading Items Rater consistency was examined using a G-study fully crossed item by skill by rater mixed effect design, in which items were treated as the object of measurement, raters a random facet, and cognitive processes a fixed facet. The computer program GENOVA (Crick & Brennan, 1983) was used to obtain the variance components and reliability coefficient, which were displayed in Table 1.
Table 1. Variance Components and Reliability Coefficient from the G-Study Degrees of Variance Effects Freedom Components Item 39 0.1144 Rater 2 0.0455 Process 7 0.0643 Item-Rater Interaction 78 0.0703 Item-Process Interaction 273 0.3324 Rater-Process Interaction 14 0.0401 Residual 546 0.3276 Reliability
18
Percent 11.50 4.57 6.46 7.07 33.42 4.03 32.94 0.75
Several notable things can be observed from Table 1. First, the reliability coefficient was 0.75, which indicates that the items were consistently coded by the raters. For a more comprehensive understanding of the raters’ performance, the different effects involving raters can be referred to. Among all effects but residual, the effects involving raters accounted for a negligible amount of the total variance. Only 4.57% of the total variance was accounted for by the rater effect, 7.07% by the item-rater interaction, and 4.03% by the rater-process interaction. Hence, it can be concluded that the raters performed consistently across processes and across items. Second, the largest variance component came from the item-process interaction, indicating that different processes were required to solve different items. Third, item effect only accounted for 11.50% of the total variance, which indicates that the average ratings received by the items across different processes were comparable, and that an item receiving a high rating on one process might receive low ratings on other processes. Results of the item coding showed several major points. First, all components of the initial cognitive processing model were involved in solving the reading items on Form E and Form F, and no additional processes were raised by the raters. Second, correctly answering an item requires multiple cognitive processes. Third, correctly answering an item was often associated with text/item features, knowledge of particular lexical items, drawing inferences, and evaluating alternative options. The final set of consensus codes is presented in Appendix A. To compare the cognitive item features across forms, descriptive statistics for the consensus codes were calculated and are presented in Table 2. As the table shows, the distributions of item features were comparable across forms. For the features “applying pragmatic knowledge” and “locating information,” the mean ratings for the items on both forms were identical. For the features “percentage of specialized and infrequent vocabulary in relevant part of the text” and “the number of plausible distractors,” the mean ratings for Form F items were slightly lower than those for Form E items, but for the remaining five features, the mean ratings for Form F items were slightly higher than those for Form E items.
Table 2. Descriptive Statistics for Consensus Item Codes Process Word % Sp. Text PragRecog. Words Syx Org. matic Locate Form E N 20.00 20.00 20.00 20.00 20.00 20.00 Min. 0.00 10.00 0.00 0.00 0.00 0.00 Max. 2.00 41.67 2.00 2.00 4.00 2.00 Mean 1.30 22.04 1.45 0.80 2.00 1.20 SD 0.73 7.62 0.69 0.95 1.56 0.70 Form F N 20.00 20.00 20.00 20.00 20.00 20.00 Min. 1.00 8.70 1.00 0.00 0.00 0.00 Max. 2.00 31.58 2.00 2.00 4.00 2.00 Mean 1.50 19.86 1.60 1.05 2.00 1.20 SD 0.51 5.98 0.50 0.76 1.41 0.70
Infer.
Synthesis
Distractor
20.00 0.00 3.00 1.20 0.77
20.00 0.00 2.00 0.70 0.80
20.00 0.00 3.00 1.90 0.91
20.00 0.00 2.00 1.25 0.79
20.00 0.00 2.00 1.25 0.72
20.00 0.00 3.00 1.65 1.14
19
Verbal Reports of the Cognitive Processes The reliability of assigning processes to the various processing categories was evaluated. An independent rater coded four verbal reports, two of which were randomly selected from the Form E participants and two randomly selected from the Form F participants. The coding results of the independent rater were compared to those of the researcher. Consistency was defined as the extent to which the verbal report data segments were coded using the same processing categories by both raters. Of a total of 291 processes coded by both raters, 247 agreements occurred. Hence, the percentage of total agreement between the researcher and the independent rater was 85%. The total number of the cognitive processes codes assigned to each verbal report is presented in Table 3.
Table 3. Cognitive Processes Frequencies for Each Participant and Form Process /Partici. PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 E1 9 6 6 1 5 3 18 8 3 12 E2 4 4 5 0 1 2 17 10 2 11 E3 2 2 3 0 1 2 15 5 4 15 E4 6 1 0 1 0 3 12 3 1 3 E5 1 3 4 0 0 2 15 9 3 14 Total 22 16 18 2 7 12 77 35 13 55 F1 7 4 1 2 0 6 15 11 7 13 F2 6 5 0 2 3 4 9 4 5 8 F3 3 4 2 0 3 4 16 8 2 13 F4 4 9 6 0 1 7 15 12 3 13 F5 6 3 1 1 3 5 12 8 5 9 Total 26 25 10 5 10 26 67 43 22 56
Other Total 1 72 2 58 4 53 2 32 1 52 10 267 2 68 2 48 2 57 1 71 4 57 11 301
PC = Processing Component; component in the initial cognitive processing model.
Table 3 provides insights into the cognitive processes used by the participants while they were answering the MELAB reading items on Form E and Form F. For both forms, the cognitive process most frequently inferred from the verbal reports was PC7 (scanning for details/matching the question to the relevant information in the text). The participants taking Form E reported this process 77 times in total and the participants taking Form F reported 67 times in total. The second and the third most frequently reported processes were PC10 (Evaluate alternative options) and PC8 (drawing text-based inference), respectively. More cognitive processes can be inferred from the Form F participants’ verbal reports than from the Form E participants’ verbal reports. Among the participants taking Form E, E1 reported the highest number of processes (a total of 72), and these processes covered all categories in the proposed cognitive processing model. The processes most frequently reported by this participant included PC7 (scanning details), PC10 (evaluating alternative options), PC1 (identifying word and word meaning), and PC8 (drawing inferences). This participant correctly answered 16 of the 20 items. Among 20
the participants taking Form F, F4 reported the highest number of processes (a total of 71) and these processes covered all categories in the cognitive model but PC4 (speculating beyond the text). The processes frequently reported by this participant included PC7 (scanning details), PC10 (evaluating alternative options), PC8 (drawing inferences), and PC2 (using syntax knowledge). This participant correctly answered 19 of the 20 items. The participants reporting the lowest number of processes on each form were the ones who scored the lowest in the group of participants taking that form. Participant E4 reported a total of 32 processes and scored 13 correct out of 20 on the Form E reading section, and participant F2 reported a total of 48 processes and scored 11 correct out of 20 on the Form F reading section. Additional processes obtained from the participants’ verbal reports can be classified into three categories. The first category includes metacognitive and metalinguistic strategies, such as deciding an answer after all options are evaluated, translating into Chinese, going back and forth between text and items, marking the text as reading to help locate the information when answering the questions, skipping specialized nouns, answering easier items first, being aware of the processes used, analyzing what the question assesses, and switching to other processes (e.g., evaluating options) to save time when one process doesn’t work (e.g., the required information can’t be found in text). The second category includes construct-irrelevant processes, such as random guessing or guessing based on a constructed situation model or prior knowledge. The third category is related to affect and memory, such as “I find it hard to concentrate at the beginning,” “I like scientific text,” and “I cannot remember where I read this in the text.” While these data provided invaluable insights into examinees’ item solving processes, they were not added to the initial cognitive processing model, given the considerations that (1) the use of these processes in item solving varied from person to person and from item to item, (2) they were hard to code for consideration in statistical models, and (3) they were not included in the constructs assessed by the MELAB reading section. Next, to validate the components of the cognitive model and the item features coded by the raters, the processes used by the participants who correctly answered each item were summarized and compared to the final set of consensus codes obtained from item coding. The results of this comparison are presented in Appendix C. An examination of Appendix C reveals several major points. First, the cognitive processes inferred from the participants’ verbal reports provide evidence that correctly answering an item often requires multiple processes. Second, the processes used to correctly answer the reading items on both forms covered all components of the proposed cognitive processing model. Third, the processes reported by the participants who correctly answered each item supports the coding of the MELAB items in terms of the cognitive processes required to correctly answer each item. Of a total of 160 features coded for the reading items on Form E (20 items x 8 variables), 117 item features (73.1%) are supported by the verbal report data. When the process of using pragmatic knowledge, which was reported infrequently by the participants, is excluded, 112 of the remaining 140 item features (80.0%) are supported by the verbal data. Likewise, of a total of 160 features coded for the reading items on Form F (20 items x 8 variables), 111 item features (69.4%) are supported by the verbal data, and when the process of using pragmatic knowledge is excluded, 106 of the remaining 140 item features (75.7%) are supported by the verbal report data. Hence, it is considered that the item features were reasonably coded and no further modifications were made to the final set of consensus item codes.
21
For each cognitive variable coded by the raters, the total number and proportion of items for which the verbal data included that feature are presented in Table 4. Proportion is defined as the number of items for which the coding could be validated by the verbal data divided by a total of 40 items coded on that variable. As Table 4 shows, the highest degree of correspondence between the item coding and the verbal data occurs on the variable Locate. All 40 items on both forms coded for this feature are supported by the verbal data. The lowest degree of correspondence between item coding and verbal data occurs on the variable Pragmatic. Of a total of 40 items coded for this feature, only 10 items could be validated by the verbal data. Generally, there is more overlap between the item codes and verbal data for the variables Locate, Distractor, Inference, and Synthesis than that for the variables Word, Syntax, Text Organization, and Pragmatic.
Table 4. Correspondence between Item Codes and Verbal Report Data Word Text PragSynProcess Recog. Syntax Org. matic Locate Infer. thesis f 27 21 27 10 40 34 33 % 67.5 52.5 67.5 25.0 100.0 85.0 82.5
Distractor 35 87.5
Results of Item Difficulty Modeling Psychometric Characteristics The psychometric characteristics of the MELAB reading items on Form E and Form F are summarized in Table 5. As the table shows, the psychometric characteristics of the two sections are comparable, though the reliability of the reading section on Form F is slightly lower than that of the reading section on Form E. The empirical data supports the parallelism of the two sections. Item parameter estimates were calibrated using the testlet response theory (TRT) model (Wang et al., 2002). Item difficulty parameter estimates for the reading items on Form E and Form F are presented in Appendix B.
Table 5. Descriptive Statistics and Reliability for the Two Reading Sections Mean SD Reliability Form Nscores Minimum Maximum Median E 1703 0.00 20.00 11.00 10.94 4.19 0.79 F 1044 1.00 20.00 11.00 10.71 3.65 0.71
TBR Analyses Separate TBR analyses were conducted on Form E and Form F. Both analyses started with nine predictors. Form E. For Form E, five of the nine predictors entered the tree: Distractor (number of plausible distractors), Pragmatic (pragmatic knowledge), Syntax (syntax knowledge), Text Org. (knowledge of text organization), and Speword (proportion of specialized and infrequent words in the part of the text where the information for answering the question is located) (See Figure 1). Taken together, these five variables accounted for 90.4% of the total variance in item difficulty. 22
As some important predictors may be masked in the tree-building process, it is crucial to inspect the importance of the predictors to the model (Breiman, et al, 1984). Table 6 presents the importance of individual predictors in the item difficulty model built for Form E. As the table shows, the most important predictor in the model is Distractor. However, the predictor Inference, which did not appear in the model, is the second most important predictor, and is far more important than the remaining predictors. It is highly likely that this predictor was masked in the tree-building process and given the importance of the predictor Inference, it appeared unwarranted to exclude it from the model. Consequently, a new tree-building process was undertaken by successively adding in predictors based on their importance to the model. According to the statistical principle of parsimony (e.g., Kerlinger, 1979), two stopping rules were used: (1) the newly added predictor did not lead to a significant increase in the explained variance of item difficulty, and (2) the total variance in item difficulty was maximally explained. The model that explained the largest amount of variance in item difficulty with the least number of predictors was considered the most parsimonious and therefore used for interpretation. First, the most important predictor, Distractor, was used in the model and this predictor explained 41.4% of the total variance in item difficulty. Next, the predictor Inference was entered into the model. However, this predictor did not lead to any increase in the explained variance and 60% of the variance in item difficulty was left unexplained. It was likely that Inference was again masked in this tree-building process. Subsequently, the next three important predictors, Pragmatic, Speword, and Synthesis, were successively fed into the model. Table 7 displays the contribution of the top five important predictors to the explained variance in item difficulty.
Figure 1. Initial Tree Diagram for the Reading Items Included in Form E.
23
Table 6. Importance of Individual Predictors in the Item Difficulty Model for Form E Predictors Importance Normalized Importance (%) Distractor 0.422 100.0 Inference 0.329 77.9 Pragmatic 0.207 49.1 Speword 0.169 40.0 Synthesis 0.122 28.9 Syntax 0.120 28.5 Locate 0.100 23.7 Text Org. 0.096 22.8 Word Recog. 0.049 11.7
Table 7. The Contribution of the Predictors to Explained Variance on Form E Predictors Total Variance Explained (%) Unique Variance Explained (%) Distractor 41.4 41.4 Inference 41.4 0.0 Pragmatic 85.8 44.4 Speword 90.7 4.9 Synthesis 91.1 0.4
As Table 7 shows, the predictor Inference did not increase the explained variance in item difficulty. However, when the predictors Inference and Pragmatic were fed into the model, a drastic increase was achieved in the explained variance (44.4%). Given the importance of the predictors Inference and Pragmatic and the contribution of both predictors to the improvement of the model, this increase in the variance explained was likely from the joint contribution of Inference and Pragmatic. The predictor Speword explained an additional 4.9% of the total variance in item difficulty. However, when the predictor Synthesis was fed into the model, there was virtually no increase in the explained variance (0.4%). Therefore, for Form E, the tree was built with four predictors: Distractor, Inference, Pragmatic, and Speword, which accounted for 90.7% of the total variance in item difficulty. Figure 2 presents the tree diagram for Form E with four predictors. Figure 2 presents the mean, standard deviation, and the number of items for each node. As the figure shows, Distractor produced one split at the first level and one at the second level. Both splits indicated that the items with more plausible distractors tended to be more difficult than the items with less plausible distractors. At the second level, Pragmatic produced a split, indicating that the items requiring more pragmatic knowledge (e.g., analyzing author’s opinion and extrapolation) tended to be more difficult than the items requiring less of such knowledge (e.g., facts). Similarly, at the third level, Inference produced a split, indicating that the items requiring high text-based inference or speculation beyond text tended to be more difficult than the items requiring no or low text-based inference. At the third level, Speword produced another split, indicating that the items requiring processing the part of the text 24
containing more specialized and infrequent words tended to be less difficult than the items requiring processing the part of the text containing less of such words.
Figure 2. Tree Diagram for the Form E Reading Items.
Form F. For Form F, three of the nine predictors entered the tree: Distractor, Inference, and Syntax. Taken together, these three predictors accounted for 94.5% of the total variance in item difficulty. Figure 3 presents the mean, standard deviation, and the number of items for each node. As the figure shows, Distractor produced the first split, separating the 20 items into two groups with different mean item difficulties. This split again indicated that the items with more plausible distractors tended to be more difficult than the items with less plausible distractors. At the second level, a split was made based on the predictor Inference, which again indicated that the items requiring high text-based inference tended to be more difficult than the items requiring no or low text-based inference. The predictor Syntax produced one split at both the second and the third level. These two splits indicated that the items requiring knowledge of complex or infrequent sentence structure tended to be more difficult than the items requiring knowledge of simple sentence structure. The importance of the predictors to the model for Form F was inspected and is presented in Table 8. As the table shows, the three predictors entering the tree model (i.e., Distractor, Inference, and Syntax) were the three most important predictors to the model. Hence, the tree built with the three predictors was taken as the final model for Form F.
25
Figure 3. Tree Diagram for the Form F Reading Items.
Table 8. Importance of Individual Predictors to the Model for Form F Predictors Importance Normalized Importance (%) Distractor 0.792 100.0 Inference 0.444 56.0 Syntax 0.157 19.8 Pragmatic 0.145 18.3 Speword 0.135 17.1 Locate 0.082 10.3 Word Recog. 0.040 5.1 Synthesis 0.012 1.5 Text Org. 0.007 0.9
26
Discussion
In this study, a three-pronged procedure was employed to develop and test a cognitive processing model hypothesized to underlie MELAB reading item performance. First, theoretical information regarding the L2 reading processes and reading ability constructs were reviewed. Next, to provide clear, faithful, and informative definitions of the cognitive processes involved in solving the MELAB reading items, cognitive demands of the items were analyzed and the cognitive processes that examinees might use to correctly answer the items were investigated. Finally, the proposed cognitive processes were validated through empirical evaluation of objective performance on the MELAB reading items using a cognitively based measurement model called tree-based regression. Summary and Discussion of the Findings Research question 1: What cognitive processes are required to correctly answer the MELAB reading items? Three raters independently coded the MELAB reading items on Form E and Form F in terms of the cognitive processes required to correctly answer each item. An examination of the rater consistency using G-theory indicated a fairly high level of rater agreement (ρ = 0.75), given that only three raters were used. Contrary to the results in Alderson and Lukmani (1989) and Alderson (1990), this finding appears to support that raters can reach agreement on the cognitive demands of an item. It appears that the use of rater training, a clearly defined rating instrument, extensive discussion, and exemplification of item coding in this study contributed to the agreement among the raters. Results of item coding show that correctly answering an item requires multiple cognitive processes, which provides evidence that solving a reading item involves simultaneous use of different cognitive components (Gorin, 2002). Results of item coding show that the cognitive processes required to correctly answer the MELAB reading items include word recognition skills, knowledge of syntax and text organization, pragmatic knowledge, skimming the text for gist, scanning the text for specific details, drawing inferences, synthesis, and evaluating alternative options, and that different cognitive processes are involved in solving different items. Results of the item coding in terms of the cognitive processes required to correctly answer the MELAB reading items are consistent with the constructs assessed by the MELAB reading section and support the cognitive processing model proposed in this study.
Research question 2: What cognitive processes are actually used by examinees when they correctly answer the MELAB reading items? How are they related to the findings in response to question 1? Results of the verbal report data analysis show that the cognitive processes that examinees might use to correctly answer the MELAB reading items include the use of word recognition skills, knowledge of sentence and text structure, prior knowledge, pragmatic knowledge, skimming the text for gist, scanning the text for specific details, drawing inferences, synthesis, evaluating alternative options, metalinguistic and metacognitive strategies, and testwiseness. For both MELAB forms, using prior knowledge beyond text and using pragmatic knowledge were the two cognitive processes least frequently reported by the participants. Given that using prior knowledge is irrelevant to the construct assessed by the MELAB reading items, it is no wonder that this process was reported least frequently. 27
A comparison of the cognitive processes coded for each item to the cognitive processes inferred from the verbal reports for each item found a high degree of match between the two sources of data. Of a total of 320 item features coded for both forms (40 items x 8 cognitive variables), 228 item codes (71.3%) matched the cognitive processes inferred from the verbal reports for the corresponding item. The match occurs more frequently on the processes of locating/scanning specific details, evaluating alternative options, inference, and synthesis than on the processes of identifying word meaning, using text organization, syntactic, and pragmatic knowledge. The inconsistencies between the cognitive processes coded by the raters and the cognitive processes inferred from the verbal reports should not lead to hasty judgments about the untrustworthiness of the raters’ coding. Leighton (2004) warned that verbal reports were sensitive to the demands of the task, and that they were difficult to obtain when “the task used to elicit the reports was exceedingly difficult or called upon automatic processes” (p. 12). The participants in this study were advanced-level adult L2 learners who need to use English for university-level academic studies. They were considered to (1) have mastered the basic word recognition skills, vocabulary, sentence structure, text organization, and pragmatic knowledge required for reading academic text in English, and (2) be literate in their L1 and able to use various cognitive strategies already developed from reading in their L1 to facilitate their reading in L2 (Koda, 2005; Urquhart & Weir, 1998). Hence, it is likely that the processes related to basic English language knowledge, such as word recognition skills, sentence and text structure knowledge, and pragmatic knowledge, have become automatic to this group of participants, while the processes related to cognitive skill and problem-solving strategies, such as locating specific information, evaluating alternative options, inference, and synthesis, were consciously used by the participants when answering the MELAB reading items. Given that the controlled processes rather than the automatic processes are accessible for description through verbal reporting, analyzing the cognitive demands of an item before collecting the verbal reports anticipated the automatic and the controlled processes evoked by the test items and provided valuable information to supplement the verbal report data (Ericsson & Simon, 1993; Leighton, 2004). A combination of cognitive analysis of the items and verbal reports in the current study provided an opportunity to triangulate the processes involved in item solving, and to better determine the components of the cognitive processing model and the item features for consideration in the statistical model. Research question 3: To what extent do the cognitive processes used to correctly answer the MELAB reading items explain the empirical indicators of item difficulty? The TBR analysis on the two forms did not converge. For Form E, four predictors explained 90.7% of the total variance in item difficulty. These four predictors were Distractor, Inference, Pragmatic, and Speword, which were, respectively, related to the cognitive processes of evaluating alternative options, drawing inferences, using pragmatic knowledge, and processing academic text with specialized and infrequent words. The results of the TBR analysis on Form E indicated that the items requiring higher level reasoning skills to make decisions regarding the response options, advanced pragmatic knowledge, and processing texts with fewer specialized and infrequent words tended to be more difficult. The finding about specialized and infrequent words in text appears counterintuitive and needs to be replicated using other test forms. The verbal report data may shed some light on the reason for this finding. The participants E1, E4, E5, and F3 all indicated that specialized and infrequent 28
words, especially nouns, did not affect their reading or item solving, as such words could be skipped during reading and used as key words to locate the requested information in the text when answering the items. For Form F, three predictors explained 94.5% of the total variance in item difficulty. These three predictors were Distractor, Inference, and Syntax, which were, respectively, related to the cognitive processes of evaluating alternative options, drawing inferences, and using syntax knowledge. The results of the TBR analysis on Form F indicated that items requiring higher level reasoning skills to make decisions regarding the response options and knowledge of complex sentence structures tended to be more difficult. The inconsistent prediction of item difficulty across the forms indicates that item features likely differed among test forms due to the passages used and the nature of items included. This finding speaks for the complexity of item analysis and reminds us that caution needs to be exerted when interpreting reading performance on different test forms. Results of this study showed that while the statistical properties (e.g., descriptive and reliability) of Form E and Form F supported their parallelism, the cognitive processes elicited by the items on the two forms were not identical. Hence, besides analyzing the statistical properties of a test, substantive evidence regarding the nature of constructs assessed by the test needs to be sought to better understand the validity of the test. In addition, to ensure parallelism of test forms, tests may be constructed based on predetermined cognitive processes defined from a cognitive model. As Gorin (2002) recommended, an effective strategy for constructing and evaluating reading test items may be integrating statistical analysis with substantive analysis of the items. While the two TBR analyses conducted in this study produced somewhat divergent results, both TBR models were relevant to the theoretical constructs of the MELAB reading section and accounted for a substantial amount of the variance in item difficulty. In addition, the pattern of agreement between the two analyses shed some light on which of the construct-relevant item features most likely affected the performance on the MELAB reading items, which could be used to guide test development and item analysis. For example, both TBR analyses indicated that the items with more plausible distractors tended to be more difficult than the items with less plausible distractors, and that the items requiring high text-based inference or speculation beyond the text tended to be more difficult than the items requiring no or low text-based inference. Such item features were consistent with the components of evaluating alternative options to decide the one that best fit, drawing text-based inferences, and speculating beyond the text in the cognitive processing model proposed in this study. In this sense, the TBR models provided evidence that this cognitive processing model was capable of describing the cognitive processes underlying the MELAB reading item performance and suggesting cognitively based mechanisms for designing new reading items (Gorin, 2002; Embretson, 1999). Practical Implications The cognitive processing model proposed in this study has implications for the construct validity of MELAB reading as a measure of L2 reading proficiency required for college-level academic study (Embretson, 1998; Gorin, 2002; Huff, 2003). In addition, results of this study may guide test developers to design cognitively based reading items (Enright, Morley, & Sheehan, 2002). As Gitomer and Rock (1993) suggest, “improved test design consists of building items that are constructed on the basis of an underlying theory of problem-solving performance” (p. 265). Most importantly, results of this study can be used to
29
develop descriptive score reports and lay a foundation for the MELAB as a diagnostic measure. The TBR item difficulty model developed in this study produced clusters of items requiring similar cognitive processes. By summarizing examinee performance against item clusters, the TBR item difficulty models can be used to generate group- and examinee-level proficiency profiles (Sheehan, 1997). In this manner, large-scale language testing programs will be able to provide more meaningful feedback to score users about examinees’ strengths and weaknesses in particular reading skills, suggest areas for improvement, and target instruction to individual needs (DiBello & Crone, 2001; Huff, 2003; Sheehan, 1997; Wainer, Sheehan, & Wang, 2000). Limitations and Directions for Future Research One limitation regarding this study is that only 40 reading items included in two forms were used to develop the cognitive processing model. To obtain reliable item features affecting MELAB reading item performance, a larger number of items from more test forms need to be examined. A second limitation is that the item features were coded by three raters having experience in teaching reading to adult ESL/EFL learners and validated using a small group of Chinese students enrolled in a university-level program. Hence, the item features obtained may not represent the cognitive processes of examinees from other language backgrounds and proficiency levels. As item difficulty is affected by the interaction between examinee and test task (Bachman, 2002), it is highly likely that item difficulty varies across language groups. Therefore, a promising area of research for the MELAB is to examine item difficulty conditioned on language background to determine whether the items perform differently for different language groups. In addition, a larger sample size for testing the cognitive processes through verbal reports may reveal more meaningful information and increase the correspondence between the item features coded by raters and the item features inferred from verbal reports. A third limitation is that the cognitive model proposed in this study did not include metacogntive and metalinguistic strategies reported by the verbal report participants, given the difficulty in coding and including such item features in statistical models. Future research may examine the relationship between such strategies and reading item difficulty, and include them in the cognitive processing model. Finally, this study validated the proposed construct by examining empirical indicators of item difficulty. Modeling other item statistics, such as item discrimination, in terms of the cognitive processes involved in item solving may reveal more meaningful information for test developers. Conclusion
In this study, a model of cognitive processes underlying MELAB reading item performance was developed and tested. The model linked substantive theories in the domain of L2 reading to the MELAB reading items. The embracement of theoretical information regarding L2 reading processes and substantive analysis of the reading items, which is lacking in current research on the MELAB, will make possible theory-based test development and score interpretations. Moreover, the integration of cognitive theories on L2 reading and a cognitively based measurement model contributes to our understanding of the relationship between item features and item difficulty, informs the design of cognitively based reading items, and lays a foundation for the MELAB as a diagnostic measure. Finally, the threepronged procedure used to develop and validate the cognitive model, that is, analysis of an 30
item’s cognitive demands to explore the automatic versus controlled processes evoked by test items, collection of verbal reports to investigate the actual cognitive processes used by examinees when answering test items, and TBR to model item performance, promotes the union of cognitive psychology and assessment in the field of second/foreign language testing. Acknowledgements
I would like to express my sincere thanks to the English Language Institute of the University of Michigan for funding this project, providing the data and materials on the MELAB reading, and editing this working paper. My heartfelt appreciation is extended to the raters, the verbal report participants, and my colleague for their tremendous support. References
Abbott, M. L. (2005). English reading strategies: Differences in Arabic and Mandarin speaker performance on the CLBA reading assessment. Unpublished doctoral dissertation, University of Alberta, Edmonton, Alberta, Canada. Alderson, J. C. (1990). Testing reading comprehension skills (part one). Reading in a Foreign Language, 6, 425–438. Alderson, J. C. (2000). Assessing reading. Cambridge, UK: Cambridge University Press. Alderson, J. C. (2005a, July). The challenge of diagnostic testing: Do we know what we are measuring? Plenary presented at the annual meeting of the Language Testing Research Colloquium, Ottawa, Canada. Alderson, J. C. (2005b). Diagnosing foreign language proficiency. London: Continuum. Alderson, J. C., & Lukmani, Y. (1989). Cognition and reading: Cognitive levels as embodied in test questions. Reading in a Foreign Language, 5, 253–270. Allan, A. (1992). EFL reading comprehension test validation: Investigating aspects of process approaches. Unpublished doctoral dissertation, Lancaster University, Lancaster, UK. Anderson, R. C. (1972). How to construct performance tests to assess comprehension. Review of Educational Research, 42, 145–170. Anderson, N., Bachman, L., Perkins, K., & Cohen, A. (1991). An exploratory study into the construct validity of a reading comprehension test: Triangulation of data resources. Language Testing, 8, 41–66. Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press. Bachman, L. F. (2002). Some reflections on task-based language performance assessment. Language Testing, 19, 453–476. Bachman, L. F., Davidson, F., & Milanovic, M. (1996). The use of test method characteristics in the content analysis and design of EFL proficiency tests. Language Testing, 13, 125–150. Bachman, L. F., Davidson, F., Ryan, K., & Choi, I. (1995). An investigation into the comparability of two tests of English as a foreign language. Cambridge, UK: Cambridge University Press. Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University Press.
31
Bernhardt, E. (2003). Challenges to reading research from a multilingual world. Reading Research Quarterly, 38, 112–117. Block, E. (1992). See how they read: Comprehension monitoring of L1 and L2 readers. TESOL Quarterly, 26, 319–341. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth International. Butcher, K. R., & Kintsch, W. (2003). Text comprehension and discourse processing. In A. F. Healy, & R. W. Proctor (Eds.), Handbook of psychology: Experimental psychology (pp. 575–595). New York: John Wiley & Sons. Carr, N. T. (2003). An investigation into the structure of text characteristics and reader abilities in a test of second language reading. Unpublished doctoral dissertation, University of California, Los Angeles, USA. Carrell, P. L. (1983a). Some issues in studying the role of schemata, or background knowledge, in second language comprehension. Reading in a Foreign Language, 1, 81–92. Carrell, P. L. (1983b). Three components of background knowledge in reading comprehension. Language Learning, 33, 183–203. Carrell, P. L. (1984). The effects of rhetorical organization on ESL readers. TESOL Quarterly, 17, 441–469. Carrell, P. L. (1985). Facilitating ESL reading by teaching text structure. TESOL Quarterly, 19, 727–752. Carver, R. P. (1992). Effect of prediction activities, prior knowledge, and text type upon amount comprehended: Using rauding theory to critique schema theory research. Reading Research Quarterly, 27, 165–174. Cheng, L. (2003). Academic reading strategies used by Chinese EFL learners: Five case studies. Unpublished doctoral dissertation, University of British Columbia, Vancouver, British Columbia, Canada. Cobb, T. (2004). Web VP (Version 2.0) [Computer software]. Montreal, Quebec, Canada: University of Montreal. Cohen, A. D., & Upton, T. A. (2005, July). Strategies in responding to the new TOEFL reading tasks. Paper presented at the annual meeting of the Language Testing Research Colloquium, Ottawa, Ontario, Canada. Crick, J. E., & Brennan, R. L. (1983). Manual for GENOVA: A generalized analysis of variance system [Computer software]. Iowa, IA: ACT. DiBello, L. V., & Crone, C. (2001, April). Technical methods underlying the PSAT/NMSQTTM enhanced score report. Paper presented at the annual meeting of the National Council of Measurement in Education, Seattle, WA. Douglas, D. (2000). Assessing languages for specific purposes. Cambridge: Cambridge University Press. Douglas, D., & Hegelheimer, V. (2005, July). Cognitive processes and use of knowledge in performing new TOEFL listening tasks. Paper presented at the annual meeting of the Language Testing Research Colloquium (LTRC), Ottawa, Ontario, Canada. Embretson, S. E. (1998). A cognitive design system approach to generating valid tests: Application to abstract reasoning. Psychological Methods, 3, 380–396. Embretson, S. E. (1999). Generating item during testing: Psychometric issues and models. Psychometrika, 64, 407–433. 32
Embretson, S. E., & Gorin, J. S. (2001). Improving construct validity with cognitive psychology principles. Journal of Educational Measurement, 38, 343–368. English Language Institute, University of Michigan. (2003). Michigan English language assessment battery technical manual.Ann Arbor, MI: English Language Institute, University of Michigan Enright, M. K., Grabe, W., Koda, D., Mosenthal, P., Mulcahy-Ernt, P., & Schedl, M. (2000). TOEFL 2000 reading framework: A working paper (TOEFL Monograph Series MS-17). Princeton, NJ: Educational Testing Service. Enright, M. K., Morley, M., & Sheehan, K. M. (2002). Items by design: The impact of systematic feature variation on item statistical characteristics. Applied Measurement in Education, 15, 49–74. Ericsson, K. A., & Simon, H. A. (1993). Protocol analysis: Verbal reports as data. Cambridge, MA: MIT Press. Ewing, M., & Huff, K. (2004, April). Using item difficulty modeling to evaluate skill categories. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA. Farhady, H., & Hessamy, G. (2005, July). An empirical investigation of the L2 reading comprehension skills. Paper presented at the annual meeting of the Language Testing Research Colloquium, Ottawa, Ontario, Canada. Freedle, R., & Kostin, I. (1993). The prediction of TOEFL reading comprehension item difficulty: Implications for construct validity. Language Testing, 10, 131–170. Freedle, R., & Kostin, I. (1999). Does the text matter in a multiple-choice test of comprehension? The case for the construct validity of TOEFL’s minitalks. Language Testing, 16, 2–32. Gitomer, D. H., & Rock, D. (1993). Addressing process variables in test analysis. In N. Frederiksen, R. J. Mislevy, & I. I. Bejar (Eds.), Test theory for a new generation of tests (pp. 243–268). Hillsdale, NJ: Lawrence Erlbaum Associates. Goodman, K. S. (1967). Reading: A psycholinguistic guessing game. Journal of the Reading Specialist, 6, 126–135. Gorin, J. S. (2002). Cognitive and psychometric modeling of text-based reading comprehension GRE-V items. Unpublished doctoral dissertation, University of Kansas, Lawrence, KS. Grabe, W. (1991). Current development in second-language reading research. TESOL Quarterly, 25, 375–406. Grabe, W. (2002). Reading in a second language. In R. Kaplan (Ed.), The Oxford handbook of applied linguistics (pp. 49–59). New York: Oxford University Press. Grabe, W., & Stoller, F. L. (2002). Teaching and researching reading. London: Pearson Education. Hudson, T. (1996). Assessing second language academic reading from a communicative competence perspective: Relevance for TOEFL 2000 (TOEFL Monograph Series MS-4). Princeton, NJ: Educational Testing Service. Hudson, T. (1998). Theoretical perspectives on reading. Annual Review of Applied Linguistics, 18, 43–60. Huff, K. (2003). An item modeling approach to providing descriptive score reports. Unpublished doctoral dissertation, University of Massachusetts, Amherst, MA.
33
Jamieson, J., Jones, S., Kirsch, I., Mosenthal, P., & Taylor, C. (2000). TORFL 2000 framework: A working paper (TOEFL Monograph Series MS-16). Princeton, NJ: Educational Testing Service. Johnston, P. (1984). Prior knowledge and reading comprehension test bias. Reading Research Quarterly, 21, 220–239. Kasai, M. (1997). Application of the rule space model to the reading comprehension section of the Test of English as a Foreign Language (TOEFL). Unpublished doctoral dissertation, University of Illinois at Urbana-Champaign, Urbana, IL. Keppel, G., & Zedeck, S. (2001). Data analysis for research designs: Analysis of variance and multiple regression/correlation approaches. New York: W. H. Freeman & Company. Kerlinger, F.N. (1979). Behavioral research: A conceptual approach. New York: Holt. Kirsch, I. S., & Mosenthal, P. B. (1990). Exploring document literacy: Variables underlying the performance of young adults. Reading Research Quarterly, 25, 5-30. Koda, K. (1996). L2 word recognition research: A critical review. The Modern Language Journal, 80, 450–460. Koda, K. (2005). Insights into second language reading: A cross-linguistic approach. New York: Cambridge University Press. Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices. New York: Springer. Kolln, M. (1999). Rhetorical grammar: Grammatical choices, rhetorical effects. Needham Heights, MA: Allyn & Bacon. Kostin, I. (2004). Exploring item characteristics that are related to the difficulty of TOEFL dialogue items (TOEFL Research Rep. No. RR-79). Princeton, NJ: Educational Testing Service. LaBerge, D., & Samuels, S. J. (1974). Toward a theory of automatic information processing in reading. Cognitive Psychology, 6, 293–323. Lee, Y. (2004). Examining passage-related local item dependence (LID) and measurement construct using Q3 statistics in an EFL reading comprehension test. Language Testing, 21, 74–100. Leighton, J. P. (2004). Avoiding misconception, misuse, and missed opportunities: The collection of verbal reports in educational achievement testing. Educational Measurement: Issues and Practice, 23(4), 6–15. Leighton, J. P., & Gierl, M. (2005, May). Identifying models of cognition in educational measurement. Paper presented at the annual meeting of the Canadian Society for the Study of Education, London, Ontario, Canada. Lumley, T., & Brown, A. (2004). Test-taker and rater perspectives on integrated reading and writing tasks in the Next Generation TOEFL. Language Testing Update, 35, 75–79. McKeown, M. G., Beck, I. L., Sinatra, G. M., & Losterman, J. A. (1992). The contribution of prior knowledge and coherent text to comprehension. Reading Research Quarterly, 27, 79–93. Mislevy, R. J. (1994). Evidence and inference in educational assessment. Psychometrika, 59, 439–483. Mislevy, R. J. (1996). Test theory reconceived. Journal of Educational Measurement, 33, 379–416.
34
Mislevy, R. J., Almond, R. G., Yan, D., & Steinberg, L. S. (1999). Bayes nets in educational assessment: Where do the numbers come from? In K. B. Laskey, & H. Prade (Eds.), Proceedings of the fifteenth conference on uncertainty in artificial intelligence (pp. 437–446). San Francisco, CA: Morgan Kaufmann. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2002). Design and analysis in task-based language assessment. Language Testing, 19, 477–496. Munby, J. (1978). A communicative syllabus design: A sociolinguistic model for defining the content of purpose-specific language programmes. New York: Cambridge University Press. National Research Council. (2001). Knowing what students know: The science and design of educational assessment. Washington, DC: National Academy Press. Nelson, L. R. (2000). Item analysis for tests and surveys using Lertap 5 [Computer software]. Perth, Australia: Curtin University of Technology. Nichols, P. D. (1994). A framework for developing cognitively diagnostic assessments. Review of Educational Research, 64, 575–603. Perfetti, C. A. (1995). Cognitive research can inform reading education. Journal of Research in Reading, 18, 106–115. Phakiti, A. (2003). A closer look at the relationship of cognitive and metacognitive strategy use to EFL reading achievement test performance. Language Testing, 20, 26–56. Phillips, L. M., & Norris, S. P. (2002). Schema theory criticisms. In B. J. Guzzetti (Ed.), Literacy in America: An encyclopedia of history, theory, and practice (pp. 558–561). Santa Barbara, CA: ABC-CLIO. Radford, A. (2004). English Syntax: An introduction. New York: Cambridge University Press. Rayner, K., & Pollatsek, A. (1989). The psychology of reading. Englewood Cliffs, NJ: Prentice Hall. Roller, C. (1990). Commentary: The interaction of knowledge and structure variables in the processing expository prose. Reading Research Quarterly, 25, 79–89. Ruddell, R. B., & Ruddell, M. R., & Singer, H. (1994). Theoretical models and processes of reading. Newark, Delaware: International Reading Association. Rumelhart, D. E. (1977). Toward an interactive model of reading. In S. Domic (Ed.), Attention and performance (pp. 28–59). New York: Academic Press. Rumelhart, D. E. (1980). Schemata: The building blocks of cognition. In R. J. Spiro, B. C. Bruce, & W. F. Brewer (Eds.), Theoretical issues in reading comprehension: Perspectives from cognitive psychology, linguistics, artificial intelligence, and education (pp. 33–58). Hillsdale, NJ: Lawrence Erlbaum Associates. Rupp, A. A., Garcia, P., & Jamieson, J. (2001). Combining multiple regression and CART to understand difficulty in second language reading and listening comprehension tests. International Journal of Testing, 1, 185–216. Samejima, F. (1997). Graded response model. In W. J. Van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 85–100). Ann Arbor, MI: Edwards Brothers. Sheehan, K. (1997). A tree-based approach to proficiency scaling and diagnostic assessment. Journal of Educational Measurement, 34, 333–352.
35
Sheehan, K., & Ginther, A. (2001, April). What do multiple choice verbal reasoning items really measure? An analysis of the cognitive skills underlying performance on a standardized test of reading comprehension skill. Paper presented at the annual meeting of the National Council on Measurement in Education, Seattle, WA. Smith, F. (1971). Understanding reading: A psycholinguistic analysis of reading and learning to read. New York: Holt, Rinehart and Winston. Smith, F. (2004). Understanding reading: A psycholinguistic analysis of reading and learning to read. Mahwah, NJ: Lawrence Erlbaum Associates. SPSS for windows. (2005). SPSS (Version 13.0) [Computer software]. Chicago, IL: SPSS. Stanovich, K. E. (1980). Towards an interactive compensatory model of individual differences in the development of reading fluency. Reading Research Quarterly, 16, 32–71. Stanovich, K. (2000). Progress in understanding reading: Scientific foundations and new frontiers. New York: Guilford Press. Strong-Krause, D. (2001). English as a second language speaking ability: A study in domain theory development. Unpublished doctoral dissertation , Brigham Young University, Provo, UT. Tatsuoka, K. K. (1995). Architecture of knowledge structures and cognitive diagnosis: A statistical pattern recognition and classification approach. In P. D. Nichols, S. F. Chipman, & R. L. Brennan (Eds.), Cognitively diagnostic assessment (pp. 327–359). Hillsdale, NJ: Erlbaum. Thompson, G. (2004). Introducing functional grammar (2nd ed.). New York: Oxford University Press. Thorndike, R. L. (1917). Reading as reasoning: A study of mistakes in paragraph reading. Journal of Educational Psychological Association, 8, 323–332. Urquhart, A. H., & Weir, C. J. (1998). Reading in a second language: Process, product and practice. New York: Addison Wesley Longman. Wainer, H., & Lukhele, R. (1997). How reliable are TOEFL scores? Educational & Psychological Measurement, 57, 741–758. Wainer, H., Sheehan, K. M., & Wang, X. (2000). Some paths toward making Praxis scores more useful. Journal of Educational Measurement, 37, 113–140. Wang, X., Bradlow, E. T., & Wainer , H. (2002). A general Bayesian model for testlets: Theory and applications. Applied Psychological Measurement, 26, 109–128. Wang, X., Bradlow, E. T., & Wainer , H. (2004). User’s guide for SCORIGHT (version 3.0): A computer program for scoring tests built of testlets including a module for covariate analysis. Princeton, NJ: Educational Testing Service; Philadelphia, PA: National Board of Medical Examiners. Yang, P. (2000). Effects of test-wiseness upon performance on the test of English as a foreign language. Unpublished doctoral dissertation, University of Alberta, Edmonton, Alberta, Canada.
36
Appendix A
Consensus Codes for the MELAB Reading Items Form/
Word
% Dif.
Item
Recog.
Words
E/1 E/2 E/3 E/4 E/5 E/6 E/7 E/8 E/9 E/10 E/11 E/12 E/13 E/14 E/15 E/16 E/17 E/18 E/19 E/20 F/1 F/2 F/3 F/4 F/5 F/6 F/7 F/8 F/9 F/10 F/11 F/12 F/13 F/14 F/15 F/16 F/17 F/18 F/19 F/20
2.00 2.00 2.00 1.00 2.00 2.00 1.00 2.00 .00 .00 2.00 1.00 1.00 2.00 1.00 2.00 1.00 .00 1.00 1.00 1.00 2.00 1.00 1.00 1.00 1.00 1.00 2.00 2.00 2.00 1.00 1.00 2.00 2.00 2.00 1.00 1.00 2.00 2.00 2.00
19.10 16.22 21.22 23.81 28.36 18.18 24.00 17.21 16.32 19.12 23.47 25.00 41.67 31.92 32.00 25.81 22.44 10.00 12.87 12.12 27.78 20.37 12.50 31.58 25.00 22.61 20.31 25.58 19.04 20.31 19.48 22.22 17.64 23.81 18.19 13.23 10.00 25.00 13.80 8.70
Pragmatic Syntax 2.00 2.00 1.00 2.00 2.00 .00 1.00 1.00 1.00 .00 1.00 2.00 1.00 2.00 2.00 2.00 1.00 2.00 2.00 2.00 1.00 2.00 2.00 1.00 1.00 1.00 1.00 2.00 2.00 2.00 1.00 2.00 2.00 2.00 2.00 1.00 2.00 2.00 1.00 2.00
Text Org. Knowledge 2.00 .00 2.00 .00 2.00 .00 .00 2.00 .00 .00 .00 .00 2.00 2.00 .00 1.00 .00 1.00 2.00 .00 1.00 1.00 .00 .00 .00 2.00 1.00 2.00 2.00 1.00 2.00 1.00 1.00 2.00 1.00 1.00 .00 1.00 2.00 .00
.00 .00 4.00 3.00 3.00 .00 .00 1.00 .00 4.00 3.00 1.00 3.00 2.00 4.00 3.00 3.00 3.00 .00 3.00 .00 .00 .00 3.00 2.00 .00 3.00 4.00 4.00 3.00 2.00 3.00 3.00 3.00 3.00 2.00 3.00 1.00 1.00 .00
Locate 1.00 2.00 .00 1.00 1.00 2.00 1.00 .00 1.00 .00 1.00 2.00 1.00 1.00 2.00 2.00 2.00 2.00 1.00 1.00 2.00 1.00 2.00 2.00 1.00 .00 1.00 2.00 1.00 1.00 .00 1.00 2.00 1.00 1.00 .00 2.00 2.00 1.00 1.00
Inference Synthesis Distractor 2.00 1.00 2.00 1.00 2.00 1.00 1.00 2.00 .00 3.00 .00 1.00 .00 1.00 1.00 1.00 1.00 1.00 2.00 1.00 .00 1.00 .00 1.00 .00 2.00 2.00 1.00 1.00 1.00 2.00 2.00 2.00 2.00 2.00 1.00 .00 2.00 2.00 1.00
2.00 .00 2.00 .00 1.00 .00 .00 2.00 .00 .00 .00 .00 1.00 1.00 .00 .00 1.00 1.00 2.00 1.00 1.00 1.00 .00 .00 1.00 2.00 1.00 1.00 2.00 1.00 2.00 1.00 2.00 2.00 2.00 2.00 .00 1.00 2.00 1.00
1.33 3.00 1.67 .67 2.00 2.33 1.33 2.33 .00 2.33 .67 2.00 .00 1.33 2.00 2.00 2.00 1.00 2.00 1.67 .00 1.00 1.33 2.67 1.00 2.33 1.67 1.00 1.00 1.00 1.33 3.00 3.00 2.33 2.33 .33 1.67 1.67 1.33 1.67
37
Appendix B
Item Difficulty Parameter Estimates for the Reading Items on Form E and Form F Form / Item Item Difficulty (b) Form / Item Item Difficulty (b) E1 .26 F1 -.71 E2 -.30 F2 -.62 E3 -.62 F3 -.42 E4 -.72 F4 1.11 E5 1.24 F5 -1.16 E6 -.25 F6 .47 E7 -.46 F7 .84 E8 .85 F8 -.86 E9 -1.16 F9 .09 E10 1.54 F10 -.45 E11 -.23 F11 .43 E12 .09 F12 1.39 E13 -1.01 F13 1.48 E14 .47 F14 1.11 E15 1.31 F15 1.78 E16 .65 F16 -1.03 E17 1.88 F17 1.37 E18 .19 F18 1.57 E19 .67 F19 .54 E20 .23 F20 1.53
38
Appendix C
Comparison of Item Coding and Actual Processes Involved in Correctly Answering Each Item Form/
Word
Item
Recog.
E/1 E/2 E/3 E/4 E/5 E/6 E/7 E/8 E/9 E/10 E/11 E/12 E/13 E/14 E/15 E/16 E/17 E/18 E/19 E/20 F/1 F/2 F/3 F/4 F/5 F/6 F/7 F/8 F/9 F/10 F/11 F/12 F/13 F/14 F/15 F/16 F/17 F/18 F/19 F/20
*2.00 *2.00 *2.00 1.00 2.00 *2.00 1.00 2.00 *0.00 *0.00 *2.00 1.00 *1.00 *2.00 *1.00 *2.00 *1.00 *0.00 *1.00 1.00 *1.00 *2.00 *1.00 *1.00 *1.00 *1.00 1.00 *2.00 2.00 *2.00 1.00 *1.00 *2.00 2.00 *2.00 1.00 1.00 *2.00 2.00 *2.00
Syntax *2.00 2.00 1.00 2.00 2.00 *0.00 *1.00 1.00 1.00 *0.00 *1.00 *2.00 1.00 2.00 *2.00 2.00 1.00 *2.00 *2.00 2.00 *1.00 2.00 *2.00 *1.00 *1.00 1.00 1.00 2.00 *2.00 *2.00 1.00 *2.00 2.00 *2.00 *2.00 1.00 *2.00 *2.00 1.00 *2.00
Text
Pragmatic
Org.
Knowledge
2.00 *0.00 *2.00 *0.00 *2.00 *0.00 *0.00 *2.00 *0.00 *0.00 *0.00 *0.00 *2.00 *2.00 *0.00 1.00 *0.00 *1.00 2.00 *0.00 1.00 1.00 *0.00 *0.00 *0.00 2.00 1.00 2.00 *2.00 1.00 *2.00 *1.00 1.00 2.00 1.00 *1.00 *0.00 *1.00 2.00 *0.00
0.00 0.00 4.00 *3.00 3.00 0.00 0.00 *1.00 *0.00 *4.00 3.00 1.00 3.00 2.00 4.00 3.00 3.00 3.00 *0.00 3.00 0.00 0.00 0.00 3.00 2.00 *0.00 3.00 4.00 4.00 3.00 *2.00 3.00 3.00 3.00 *3.00 *2.00 3.00 1.00 *1.00 0.00
Locate
Inference
Synthesis
Distractor
*1.00 *2.00 *0.00 *1.00 *1.00 *2.00 *1.00 *0.00 *1.00 *0.00 *1.00 *2.00 *1.00 *1.00 *2.00 *2.00 *2.00 *2.00 *1.00 *1.00 *2.00 *1.00 *2.00 *2.00 *1.00 *0.00 *1.00 *2.00 *1.00 *1.00 *0.00 *1.00 *2.00 *1.00 *1.00 *0.00 *2.00 *2.00 *1.00 *1.00
*2.00 1.00 *2.00 *1.00 *2.00 1.00 1.00 *2.00 *0.00 *3.00 *0.00 1.00 *0.00 *1.00 *1.00 *1.00 *1.00 *1.00 *2.00 *1.00 *0.00 *1.00 *0.00 *1.00 *0.00 *2.00 *2.00 *1.00 *1.00 *1.00 *2.00 *2.00 *2.00 *2.00 2.00 1.00 *0.00 *2.00 *2.00 *1.00
*2.00 *0.00 *2.00 *0.00 1.00 *0.00 *0.00 *2.00 *0.00 *0.00 *0.00 *0.00 *1.00 1.00 *0.00 *0.00 *1.00 *1.00 *2.00 *1.00 *1.00 *1.00 *0.00 *0.00 *1.00 *2.00 1.00 *1.00 *2.00 1.00 *2.00 1.00 2.00 *2.00 *2.00 *2.00 *0.00 *1.00 *2.00 1.00
*1.33 *3.00 *1.67 *0.67 2.00 *2.33 1.33 *2.33 *0.00 *2.33 *0.67 *2.00 *0.00 *1.33 2.00 *2.00 *2.00 *1.00 *2.00 *1.67 *0.00 *1.00 1.33 2.67 *1.00 *2.33 *1.67 *1.00 *1.00 *1.00 *1.33 *3.00 *3.00 *2.33 *2.33 *0.33 *1.67 *1.67 *1.33 *1.67
* Cognitive processes reported by the participants who correctly answered the item.
39
40
Validation and Invariance of Factor Structure of the ECPE and MELAB across Gender Shudong Wang Harcourt Assessment Inc.
The purpose of this study is twofold: (1) to validate the internal structure of the Examination for the Certificate of Proficiency in English (ECPE) and the Michigan English Language Assessment Battery (MELAB), and (2) to examine the invariance of the factor structures of both the MELAB and the ECPE across gender. For both the MELAB and the ECPE, a one-factor, or one-dimensional model was postulated and tested. The results for both tests support one-factor models. The study results also show that the internal structure of the MELAB and the ECPE are equivalent across male and female examinees, which implies that the two tests are fair across gender groups. This study supports the claim that the total score of the MELAB measures “proficiency in English as a second language for academic study” (English Language Institute, 2003) and the claim that the total score of the ECPE measures English language proficiency for admission to North American colleges and universities.
The construct underlying a test is a theoretical representation of the underlying trait, concept, attribute, process, or structures that the test is designed to measure (Cronbach, 1971; Messick, 1989). Factorial validity (Guilford, 1946), or the investigation of the factor structure underlying a test, can be a valuable component of validity evidence (Messick, 1995). Validity, according to the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999), is the most important consideration in test development and evaluation. Fairness is also required by the Standards; “Regardless of the purpose of testing, fairness requires that all examinees be given a comparable opportunity to demonstrate their standing on the construct(s) that test is intended to measure” (p. 74). In seeking evidence of test fairness, the researcher should address whether the test measures the same construct in all relevant subgroups of the populations. Fairness is closely related to the factor structure validity of the test. Factorial structure analysis can be used not only to evaluate the dimensionality of an exam, but also to provide evidence of fairness. Similarity of factor structure across gender groups, for example, suggests that the test measures the same construct(s) for males and females. Different factor structures could imply that different constructs are being measured for the two groups. If evidence of differential factor structures is found, further investigation is needed. Differential factor structures for subgroups of examinees per se cannot tell which group’s scores are more valid, nor can they explain why group differences occur. They can only serve as a flag to identify where psychological constructs may be structured differently over different subpopulations.
Spaan Fellow Working Papers in Second or Foreign Language Assessment, Volume 4, 2006 English Language Institute, University of Michigan
41
One of the goals of construct validation of test scores is to capture the important aspects of the internal construct. Several studies (Jiao, 2004; Saito, 2003; Wagner, 2004) have been conducted to examine the internal construct validity or dimensionality of the Examination for the Certificate of Proficiency in English (ECPE) and of the Michigan English Language Assessment Battery (MELAB). However, these studies only focused on partial sections of the tests, such as the Cloze section (Jiao, 2004; Saito, 2003), the Listening section (English Language Institute, 1994; 2003; Wagner, 2004), and the Grammar/Cloze/ Vocabulary/Reading (GCVR) section (English Language Institute, 1994; 2003; Jiao, 2004). Because both the ECPE and the MELAB report the total/average/final test (scale) scores as major evidence for their uses and interpretations (in fact, the ECPE is awarded only to those who obtain passing scores on all five sections), it is very important to gather internal structure validity evidence to support the claim that the total score of MELAB really measures the “proficiency in English as a second language for academic study” (English Language Institute, 2003) and the claim that the total score of ECPE measures English language proficiency for admission to North American colleges and universities. Despite substantial investment in test development and the establishment of content validity of both the ECPE and the MELAB, there is surprisingly little published research describing factorial or internal construct validity of the whole tests. Previous studies (English Language Institute, 1994; 2003; Saito, 2003; Wagner, 2004) did report factor analysis results of the Listening and GCVR sections in the MELAB and of Listening and Cloze sections in the ECPE. However, the analysis was done at either item level or subtest level using testlet or component scores. This study makes full use of information from multiple subtests and examines English language proficiency construct validity taking these subtests as a whole, and thus offers a new perspective to evaluate construct validity. Furthermore, this study tests the degree of construct equivalence across gender groups. The purposes of this study are first to validate the internal structure of the ECPE and the MELAB, and then to examine the invariance of factor structure of both the MELAB and the ECPE across gender. Method Sample and Instrument MELAB The MELAB data used in this study are from 216 examinees who took one particular combination of Listening and GCVR test forms, referred to here as Form X and Form Y. The testing for both forms took place from April 4, 2003, to June 6, 2004. There are 19 possible Composition scores, ranging from 1 to 19 for analysis purposes in this study. In addition to a Composition item (essay), there are 150 multiple-choice (MC) items for measuring other language skills. MC items 1 through 50 are 3-option Listening items. Among these Listening items, there are 10 short question items, 16 short statement items, 8 emphasis items, 7 lecture comprehension items (from one lecture), and 9 conversation comprehension items (from one conversation), administered in sequence. The remaining 100 4-option MC items measure Grammar, Cloze, Vocabulary, and Reading (GCVR). Among the 100 items, there are 30 Grammar items, 20 Cloze items (from one passage), 30 Vocabulary items (the first 14 are synonym type, the next 16 are completion type), and 20 Reading comprehension items (5 each from four passages).
42
ECPE The ECPE data were collected from 2011 examinees during the 2005 administration (data with missing values were deleted). The data were from test centers mostly located in North and South America, while the largest group of examinees, tested in Greece, was not included in this study. There is one speaking item with a rating scale from 1 to 4. Among the total 150 MC items, 50 are Listening items and 100 are GCVR items. Included in the Listening items are 14 short conversation items, 21 short question items, and 15 radio interview items. Included in the 100 GCVR items are 30 Grammar items, 20 Cloze items, 30 Vocabulary items, and 20 Reading comprehension items. MELAB and ECPE Data Analysis To investigate the factor structure of the MELAB and the ECPE and the equivalence of the factor structure for each test across gender groups, a series of analyses were conducted, as follows. First, descriptive statistics, internal consistency, and intercorrelations of raw scores of subtests/tests were used to provide general information about the test scores. Second, a series of exploratory factor analyses (EFA) using classical factor analysis procedures was conducted for the internal structural validity study. For the EFA of the MELAB, the potential models include the measurement models that use subtests (Writing, Listening, Grammar, Cloze, Vocabulary, Reading, and Speaking) and sub-subtests (Writing, short question, short statement, emphasis, lecture, Grammar, Cloze, synonym completion, reading 1–4, and Speaking) as observed variables. For the EFA of the ECPE, the potential models include the measurement models that use subtests (Speaking, Listening, Listening Interview, Grammar, Cloze, Vocabulary, and Reading) and sub-subtests (Speaking, short conversations, short questions, listening radio interview 1–3, Grammar, Cloze, Vocabulary, and reading comprehension passages 1–4) as observed variables. Third, after identifying a potential model that best explains the data in terms of theory and model fit, a confirmatory factor analysis (CFA) using structural equation modeling (SEM) was used to test the invariance of the factorial model. For the purpose of cross-validation, subjects were randomly split into two samples to form a calibration and a validation sample (Byrne, 2001). One of the purposes for using a cross-validation strategy is to assess the reliability of model fit. Having chosen a SEM model that is best for a particular sample of examinees, it is not proper to automatically assume that this SEM model can be reliably applied to other samples of the same population. However, the model that fits the data using the calibration sample can be further validated by using another sample from the same population. In order to evaluate the adequacy of the factor models to fully account for the relationships among observed variables, a series of SEMs with the maximum likelihood estimation was conducted on the calibration sample. Once model fit for each calibration sample was determined, the invariance of the model structure for the validation samples was investigated across gender. All tests of model invariance begin with a global test of the equality of covariance structures across groups (Joreskog, 1971). The data for all groups were analyzed simultaneously to obtain efficient estimates (Bentler, 1995). Then, a series of nested constraints was equally applied to the same parameters across gender groups in order to detect the configuration and factor pattern difference across gender groups. The constraints used include, from weaker to stronger: (1) model structure, (2) model structure and factor loadings, and (3) model structure, factor loadings, and unique variance.
43
Evaluation of Model Fit Changes in goodness-of-fit statistics have been examined to detect differences in structure parameters. Several well-known goodness-of-fit indices were used to evaluate model fit: the chi-square χ2, the comparative fit index (CFI), the unadjusted goodness-of-fit indices (GFI), the normal fit index (NFI), the Tucker-Lewis Index (TLI), the root mean square error of approximation (RMSEA) and the standardized root mean square error residual (SRMR). Goodness-of-fit (GOF) indices provide “rules of thumb” for the recommended cutoff values to evaluate data-model fit. Hu and Bentler (1999) recommend using combinations of GOF indices to obtain a robust evaluation of model fit. The criterion values they list for a model with good fit are CFI > 0.95, TLI > 0.95, RMSEA < 0.06, and SRMR < 0.08 for assessing fit in structural equation modeling. Hu and Bentler offer cautions about the use of GOF indices, and current practice seems to have incorporated their new guidelines without sufficient attention to the limitations noted by Hu and Bentler. Moreover, some researchers (Beauducel & Wittmann, 2005; Fan & Sivo, 2005; Marsh, Hau, & Wen, 2004; Yuan, 2005) believe that these cutoff values are too rigorous and the results by Hu and Bentler may have limited generalizability to the levels of misspecification experienced in typical practice. In general practice, a “good enough” or “rough guideline” approach is that for absolute fit indices and incremental fit indices (such as CFI, GFI, NFI, and TLI), cutoff values should be above 0.90 (0.90 benchmark) and for fit indices based on residuals matrix (such as RMSEA and SRMR), values below 0.10 or 0.05 are usually considered adequate. For the group comparisons with increased constraints, the χ2 value provides the basis of comparison with the previously fitted model. A non-significant difference in χ2 values between nested models reveals that all equality constraints hold across the groups. Therefore, the measurement model remains invariant across groups as the constraints are increased. Sample size must be taken into account, however, in interpreting a significant χ2. A significant χ2 does not necessarily indicate a departure from invariance when the sample size is large. All analyses were conducted using AMOS 4.0 (Arbuckle & Wothke, 1999) and SAS. All models were identified by fixing the one factor variance at 1.0. Results Summary Descriptive Statistics Tables 1 summarizes the n-counts, median, minimum, maximum, range, and the first four moments describing the distributions of subtest and test raw scores for the MELAB by group (total, male, and female groups). The four moments are: mean, standard deviation, skewness, and kurtosis. Table 2 provides the same information for the ECPE test. There are unequal n-counts across gender for both the MELAB and the ECPE; for the MELAB, female examinees have slightly higher mean test scores than male examinees, while for the ECPE, the mean test score of male examinees is slightly higher than that of female examinees. For both tests, female examinees have less variation of test scores than male examinees. Reliability of Subtests and Test Scores Internal consistency coefficients were computed for the subtests and the total test scores for both the MELAB and the EPCE, and are shown in Table 3. The coefficient alpha can be considered as the mean of all possible split-half coefficients. All reliability coefficients of subtests and test scores range from moderate (0.85) to high (0.95).
44
Linear Correlations among Subtest Scores It is expected that all subtest scores within each test would show some degree of correlation to one another, based on the assumption that the subtests measure general language proficiency. On the other hand, since each subtest measures different skills, it would be expected that the intercorrelations of subtests would not be very high. Pearson’s correlation coefficient was used to analyze the relationship between subtest scores. Table 4 reports the intercorrelations among the subtests of the MELAB, and Table 5 summarizes the intercorrelations among the subtests of ECPE. For the MELAB, the correlations between Composition scores and the rest of the subtest scores are very low due to the restriction of the scale range for the Composition score. Exploratory Factor Analysis Exploratory factor analysis without rotation (orthogonal solution) was used to extract the language proficiency factor underlying both MELAB and ECPE test items. Figures 1 and 2 show the scree plots of eigenvalues for the MELAB and ECPE, respectively, based on subtest scores. A similar pattern was observed for both tests. In each plot there was one large break in the data following factor 1 and then the plots flatten out beginning with factor 2. This indicates only factor 1 was dominant and accounted for meaningful variances and only this factor should be retained. The eigenvalues from the EFA for both the MELAB and the ECPE are given in Tables 6 and 7, respectively. For the MELAB, the first factor had an eigenvalue of 3.54 and accounted for approximately 60% of the common variances. For the ECPE, the first factor had eigenvalue of 2.20 and accounted for more than 90% of the common variances. Hattie (1985) suggests using the difference of eigenvalues between the first factor and the second factor divided by the difference of eigenvalues between the second factor and the third to evaluate unidimensionality. If the ratio is large (usually larger than 3), the first factor is relatively strong. Both MELAB and ECPE EFA results show that the ratio high: 5.69 for the MELAB and 37.72 for the ECPE. Lord (1980) argues that a rough procedure for determining unidimensionality was the ratio of first to second eigenvalues and inspection as to whether the second eigenvalue is not much larger than any of the others. Based on both criteria, the results in Tables 6 and 7 support the statement that there is only one meaningful factor as a dominant factor in both the MELAB and the ECPE data.
45
46 Table 1.Descriptive Statistics of MELAB Total Test and Subtest Scores for All, Female, and Male Students Sample
Test/SubTest
N
Mean
Std Dev
Median
Minimum Maximum
Range
Skewness
Kurtosis
All
Composition Listening Grammar Cloze Vocabulary Reading Total Test
216 216 216 216 216 216 216
11.11 32.31 16.28 10.35 18.22 10.92 88.07
2.68 8.32 6.36 3.79 6.54 4.00 24.45
11.00 33.00 16.00 10.00 19.00 10.00 86.00
4.00 5.00 3.00 1.00 2.00 2.00 28.00
19.00 49.00 30.00 20.00 30.00 20.00 147.00
15.00 44.00 27.00 19.00 28.00 18.00 119.00
0.47 -0.24 0.22 0.00 -0.12 0.12 0.19
0.36 -0.16 -0.64 -0.48 -0.84 -0.49 -0.39
Female
Composition Listening Grammar Cloze Vocabulary Reading Total Test
147 147 147 147 147 147 147
10.95 32.71 16.45 10.61 18.36 11.27 89.40
2.78 7.81 6.35 3.71 6.35 3.91 23.39
11.00 33.00 16.00 10.00 19.00 11.00 86.00
4.00 12.00 4.00 1.00 5.00 2.00 30.00
19.00 49.00 30.00 20.00 30.00 20.00 142.00
15.00 37.00 26.00 19.00 25.00 18.00 112.00
0.42 -0.29 0.32 -0.05 -0.08 0.07 0.27
0.06 -0.29 -0.66 -0.36 -0.82 -0.47 -0.42
Male
Composition Listening Grammar Cloze Vocabulary Reading Total Test
69 69 69 69 69 69 69
11.46 31.45 15.91 9.78 17.93 10.16 85.23
2.45 9.31 6.43 3.93 6.97 4.12 26.53
11.00 30.00 16.00 10.00 19.00 10.00 85.00
6.00 5.00 3.00 2.00 2.00 2.00 28.00
19.00 49.00 29.00 19.00 30.00 20.00 147.00
13.00 44.00 26.00 17.00 28.00 18.00 119.00
0.78 -0.09 0.03 0.12 -0.16 0.28 0.16
1.38 -0.09 -0.62 -0.60 -0.92 -0.40 -0.42
Table 2. Descriptive Statistics of EPCE Total Test and Subtest Scores for All, Female, and Male Students Sample
Test/SubTest
N
Mean
Std Dev
Median
Minimum
Maximum
Range
Skewness
All
Speaking Listening Grammar Cloze Vocabulary Reading Total Test
2011 2011 2011 2011 2011 2011 2011
3.20 39.23 21.80 12.56 17.45 15.47 97.14
0.62 6.85 4.61 3.77 4.35 3.32 15.50
3.00 40.00 22.00 13.00 17.00 16.00 98.00
1.00 14.00 7.00 0.00 5.00 1.00 42.00
4.00 50.00 30.00 20.00 30.00 20.00 132.00
3.00 36.00 23.00 20.00 25.00 19.00 90.00
-0.25 -0.77 -0.40 -0.32 0.27 -1.08 -0.39
-0.16 0.19 -0.32 -0.45 -0.09 1.07 -0.03
Female
Speaking Listening Grammar Cloze Vocabulary Reading Total Test
1179 1179 1179 1179 1179 1179 1179
3.23 39.30 21.91 12.17 17.14 15.28 96.86
0.61 6.67 4.50 3.74 4.31 3.28 15.30
3.00 40.00 22.00 12.00 17.00 16.00 98.00
1.00 16.00 8.00 0.00 5.00 3.00 42.00
4.00 50.00 30.00 20.00 30.00 20.00 132.00
3.00 34.00 22.00 20.00 25.00 17.00 90.00
-0.23 -0.74 -0.40 -0.27 0.28 -1.00 -0.40
-0.19 0.08 -0.23 -0.44 0.00 0.76 0.11
Male
Speaking Listening Grammar Cloze Vocabulary Reading Total Test
832 832 832 832 832 832 832
3.15 39.13 21.65 13.12 17.88 15.74 97.55
0.64 7.09 4.75 3.74 4.37 3.36 15.78
3.00 40.00 22.00 14.00 18.00 17.00 99.00
1.00 14.00 7.00 1.00 5.00 1.00 46.00
4.00 50.00 30.00 20.00 30.00 20.00 132.00
3.00 36.00 23.00 19.00 25.00 19.00 86.00
-0.25 -0.80 -0.38 -0.41 0.25 -1.21 -0.38
-0.15 0.28 -0.44 -0.42 -0.20 1.59 -0.20
Kurtosis
47
Table 3. Internal Consistency of MELAB and ECPE Tests and Subtests Coefficient Alpha Subtest/Test MELAB ECPE Listening .87 .85 GCVR .94 .90 Total Test .95 .92
Table 4. Intercorrelations of Raw Score of MELAB Subtests for Total Sample Subtest CO L G C V R Composition (CO) Listening (L) Grammar (G) Cloze (C) Vocabulary (V) Reading (R)
1.00 0.17 0.17 0.16 0.14 0.20
1.00 0.67 0.62 0.52 0.58
1.00 0.69 0.74 0.60
1.00 0.65 0.70
1.00 0.58
Table 5. Intercorrelations of Raw Score of ECPE Subtests for Total Sample Subtests S L G C V Speaking (S) Listening (L) Grammar (G) Cloze (C) Vocabulary (V) Reading (R)
48
1.00 0.37 0.43 0.29 0.29 0.20
1.00 0.61 0.52 0.39 0.53
1.00 0.62 0.58 0.46
1.00 0.53 0.51
1.00 0.38
1.00
R
1.00
4 3.5
Eigenvalue
3 2.5 2 1.5 1 0.5 0 1
2
3
4
5
6
5
6
Factor Number Figure 1. MELAB Factor Scree Plot
2.5
Eigenvalue
2 1.5 1 0.5 0 1
2
3
4
Factor Number Figure 2. ECPE Factor Scree Plot
49
Table 6. Eigenvalues and Common Variance Explained by the Factors of MELAB Test Factor Eigenvalues Difference* % of Variance Cumulative % 1 3.54 2.55 59.05 59.05 2 0.99 0.46 16.50 75.55 3 0.53 0.08 8.83 84.38 4 0.45 0.20 7.56 91.94 5 0.26 0.03 4.30 96.24 6 0.23 3.76 100.00 *Ratio of difference of Eigenvalues: (E1-E2)/(E2-E3) = 5.69.
Table 7. Eigenvalues and Common Variance Explained by the Factors of ECPE Test Factor Eigenvalues Difference* % of Variance Cumulative % 1 2.30 2.20 91.68 91.68 2 0.10 0.06 4.04 95.73 3 0.04 0.00 1.61 97.34 4 0.04 0.02 1.44 98.78 5 0.02 0.01 0.83 99.60 6 0.01 0.40 1.00 *Ratio of difference of Eigenvalues: (E1-E2)/(E2-E3) = 37.72.
Confirmatory Factor Analysis Evaluation of Model Fit First, for the purpose of validating the factorial structure of the test, a CFA model was investigated. Second, for the purpose of cross-validation, subjects were randomly split into two groups to form a base calibration sample and a validation sample. Figures 3 and 4 present the one-factor linear models tested using AMOS for the MELAB and the ECPE across original and validation samples. The model-fit statistics for different samples are summarized in Tables 8 (MELAB) and 9 (ECPE). For the full and cross-validation samples, the majority of values satisfy the Hu and Bentler criteria for the four fit statistics CFI, GFI, NFI, and TLI. All values satisfy the 0.90 benchmark criteria except the value for the ECPE base validation sample. All SRMR index values show that the data fit the model, while all RMSA values show that model fit is not good. All Chi-squares statistics are significant. Based on all model fit indices, the MELAB and the ECPE models fit quite well and are quite comparable for the base calibration and validation samples.
50
.04
e_c
Composition
.56
e_l
Listening .21
.74
e_g
.75
Grammar .86 .69
e_cl
.83
Proficiency
Cloze .79 .63
e_v
.76
Vocabulary
.57
e_r
Reading
Figure 3. Structure of MELAB Tested with Full Sample
51
.21
e_s
Speaking
.52
e_ss
Listening
.46
.70
e_g
.72
Grammar .84
.57
e_c
.76
Cloze .66 .43
e_v
.61
Vocabulary .37
e_r
Reading
Figure 4. Structure of ECPE Tested with Full Sample
52
Proficiency
Table 8. Summary of Fit Indices of One-Factor Model of MELAB for Full and Crossvalidation Samples Sample All Base Calibration Validation
N
df
χ2
CFI
GFI
NFI
TLI
RMSEA
SRMR
216 108 108
9 9 9
31.97 15.38 25.82
.96 .98 .93
.95 .95 .91
.95 .95 .96
.94 .97 .91
.11 .08 .11
.03 .05 .05
Table 9. Summary of Fit Indices of One-Factor Model of ECPE for Full and Cross-validation Samples Sample Full Base Calibration Validation
N
df
2011 9 1006 9 1005 9
χ2
CFI
GFI
NFI
TLI
RMSEA
SRMR
250.11 160.51 102.91
.95 .93 .96
.96 .95 .97
.94 .93 .95
.91 .88 .93
.12 .13 .10
.04 .05 .04
Test of Factorial Structure Equivalence across Gender Samples The goodness-of-fit indices for a series of nested tests of different degrees of equivalence of the factorial structure across gender under a one-factor model are presented in Tables 10 and 11, for the MELAB and the ECPE, respectively. The specified parameters for each condition were constrained to be equal for both genders. The equivalence of the factor loading and the variance of three factor models (parallel, τ-equivalent, and congeneric) were tested by placing different constraints (equal loading or variance) on two compared models. Two tests are said to be psychometrically parallel if they share an equal amount of factor loading and the specific variance. If two tests have the same factor loading, but different variance, they are τ-equivalent. Congeneric tests have the similar factor loading and variance, but not necessarily to the same degree (Byrne, Shavelson, & Muthén, 1989; Jöreskog & Sörbom, 1979; Loehlin, 2004; Lord, 1957). Some fit values satisfy the Hu and Bentler criteria for the four fit statistics, CFI, GFI, NFI, and TLI, and some do not. All values satisfy .90 benchmark criteria. Both RMSEA and SRMR indices show that the data fit the model based on the 0.10 and 0.05 criteria. All χ2 differences between nested models are not statistically significant. To select alternative models among the three models tested, a statistically nonsignificant difference in χ2 suggests that stronger models are correct. The parallel model showed the best fit to the data, which demonstrates that models for male and female students have structure, factor loading, and variance equivalence.
53
Table 10. Test of Factorial Equivalence of One-Factor Model for MELAB across Gender Sample
N
df
χ2
CFI
GFI
NFI
TLI
RMSEA
Congeneric (I) 216 19 54.32 .95 .93 .92 .91 .09 Tau-equivalent (II) 216 24 56.82 .95 .93 .92 .94 .08 Parallel (III) 216 30 58.87 .95 .92 .91 .96 .07 The levels of model constraints that were constrained to be equal across gender are: I. Model structure and latent variable variance. II. Model structure, latent variable variance, and factor loading. III. Model structure, latent variable variance, factor loading, and unique variance.
SRMR .07 .07 .07
Table 11. Test of Factorial Equivalence of One-Factor Model for ECPE across Gender Sample Congeneric (I) Tau-equivalent (II) Parallel (III)
N
df
2011 19 2011 24 2011 30
χ2
CFI
GFI
NFI
TLI
RMSEA
SRMR
248.00 248.79 266.78
.95 .95 .95
.96 .96 .96
.94 .94 .94
.92 .94 .95
.08 .07 .06
.04 .04 .04
The levels of model constraints that were constrained to be equal across gender are: I. Model structure and latent variable variance. II. Model structure, latent variable variance, and factor loading. III. Model structure, latent variable variance, factor loading, and unique variance. Summary This study examined the internal construct of the MELAB and ECPE tests. For the MELAB, although the speaking section data were not available at the time of this study, the results of overall internal structure are informative and provide insights into the construct validity of test. And, in spite of the missing writing section data, the results also show a clear picture of the internal structure of the ECPE. The one-factor, or one-dimensional model postulated and tested here supports the claim that the total score of MELAB really measures the “proficiency in English as a second language for academic study” (English Language Institute, 2003) and also supports the claim that the total score of the ECPE measures English language proficiency for admission to North American colleges and universities. The study results also show that the internal structure of the MELAB and the ECPE are equivalent across male and female examinees, which implies that the two tests are fair across gender groups. In summary, this study underscores the importance of empirical validation of language tests and provides evidence supporting the validity and fairness of the widely used MELAB and ECPE language exams. It carries the validation process beyond the contentrelated evidence that often serves as the sole documented support of validity for language exams.
54
References American Educational Research Association, American Psychological Association & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: AERA, APA, NCME. Arbuckle, J. L., & Wothke, W. (1999). Amos 4.0 user's guide. Chicago: SmallWaters Corporation. Beauducel, A., & Wittmann, W. (2005). Simulation study on fit indices in confirmatory factor analysis based on data with slightly distorted simple structure. Structural Equation Modeling, 12(1), 41–75. Bentler, P. M. (1995). EQS: Structural equations program manual. Encino, CA: Multivariate Software, Inc. Byrne, B. M. (2001). Structural equation modeling with AMOS. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Bryne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing for equivalence of factor covariance and mean structures: The issues of partial measurement invariance. Psychological Bulletin, 105(3), 456–466. Cronbach, L. (1971). Validity. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 443–597). Washington DC: American Council on Education. English Language Institute, University of Michigan. (1994). The Michigan English language assessment battery technical manual: 1994. Ann Arbor, MI: English Language Institute, University of Michigan. English Language Institute, University of Michigan. (2003). The Michigan English language assessment battery technical manua:l 2003. Ann Arbor, MI: English Language Institute, University of Michigan. Fan, X., & Sivo, S. A. (2005). Sensitivity of fit indices to misspecified structural or measurement model components: Rationale of two-index strategy revisited. Structural Equation Modeling, 12(3), 343–367. Guilford, J. P. (1946). New standards for test evaluation. Educational and Psychological Measurement, 6, 427–439. Hattie, J. (1985). Methodology review: Assessing unidimesionality of tests and items. Applied Psychological Measurement, 9, 139–164. Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1-55. Jiao, H. (2004). Evaluating the dimensionality of the Michigan English language assessment battery. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 2, 27–52. Ann Arbor, MI: English Language Institute, University of Michigan. Jöreskog, K. G. (1971b). Simultaneous factor analysis in several populations. Psychometrika, 36, 409–426. Jöreskog, K. G., & Sörbom, D. (1979). Advances in factor analysis and structural equation models. Cambridge, MA: Abt Books. Lord, F. M. (1957). A significance test for the hypothesis that two variables measure same trait except for error of measurement. Psychometrika, 22, 20. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.
55
Marsh, H.W., Kit-Tai Hau and Z. Wen (2004). In search of golden rules: Comment on hypothesis-testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and Bentler’s (1999) findings. Structural Equation Modeling, 11(3), 320–341. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13– 103). New York: American Council on Education, Macmillan. Messick, S. (1995). Standards-based score interpretation: Establishing valid grounds for valid inferences. In Proceedings of the joint conference on standard setting for large-scale assessments of the National Assessment Governing Board (NAGB) and the National Center for Education Statistics (NCES) (Vol. 2, pp. 291–305). Washington, DC: National Assessment Governing Board and National Center for Education Statistics. Saito, Y. (2003). Investigating the construct validity of the cloze section in the Examination for the Certificate of Proficiency in English. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 1, 39–82. Ann Arbor, MI: University of Michigan English Language Institute. Wagner, E. (2004). A construct validation study of the extended listening sections of the ECPE and MELAB. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 2, 1–25. Ann Arbor, MI: University of Michigan English Language Institute. Yuan, K.H. (2005). Fit indices versus test statistics. Multivariate Behavioral Research, 40(1), 115–148.
56
Evaluating the Use of Rating Scales in a High-Stakes Japanese University Entrance Examination Christopher Weaver Tokyo University of Agriculture and Technology
Rating scales provide test writers and graders with a unique opportunity to provide partial credit to test takers who are en route to targetlike performances. The actual performance of rating scales, however, is an often overlooked aspect of testing. Using Rasch measurement theory, this paper investigates the psychometric properties of three sets of rating scales used in the English section of a high-stakes university entrance examination in Tokyo, Japan. This investigation found that reducing the number of categories in two of the three rating scales resulted in improved performance. Different grading criteria and the demands of different question types were also found to mediate rating scale performance. An analysis of the collective performance of these rating scales revealed not only how well these rating scales defined test takers’ level of communicative competence, but also the extent to which revised rating scales could improve the overall performance of this subsection of the entrance examination. The paper closes with some recommendations on how to optimize rating scales as well as some suggestions for future investigations focusing upon the use of rating scales in assessment situations.
Valid and reliable test scores rely upon a number of different elements ranging from quality test items to conscientious and consistent grading procedures. An often overlooked element is, however, the construction and the performance of rating scales used to assess test takers’ abilities. Rating scales provide test designers and examination graders with a unique opportunity to award partial credit to test takers who are en route to targetlike performances. The challenge of using rating scales is determining how many steps compose a meaningful continuum from nonperformance to targetlike performance. Once a range of scores has been chosen, evaluating the actual performance of the rating scales is imperative. This line of inquiry not only investigates how well the rating scales detect differences amongst test takers, but also offers suggestions on how to improve rating scale performances in order to provide more reliable and valid test scores. This paper exemplifies this process with a set of the rating scales used in the English section of the entrance examination at a national university in Tokyo, Japan. The performance of these rating scales is important for a number of reasons. Because national universities in Japan have substantially lower tuition costs than private universities, the number of applicants exceeds the number of positions available at these institutions. As a result, the scores arising from rating scales used in the English section of the entrance examination play a significant role in determining which applicants will receive postsecondary education subsidized by the Japanese government. In addition, the Japanese Ministry of Education, Culture, Sports, Science, and Technology (2003) has directed universities to place a higher priority on assessing applicants’ English communicative competence in entrance examinations. Question Spaan Fellow Working Papers in Second or Foreign Language Assessment, Volume 4, 2006 English Language Institute, University of Michigan
57
types designed to meet this directive often rely upon rating scales to determine applicants’ current level of communicative competence. Thus, it is imperative that rating scales used in university entrance examinations perform at an optimal level. Focusing upon the internal characteristics of a rating scale (i.e., the number of points allocated to a scale and its actual performance in terms of defining applicants’ level of English communicative competence) provides a distinctive perspective from previous research. Many investigators (e.g., Kondo-Brown, 2002; Myford & Wolfe, 2003) have focused their attention upon the effects of rater severity and/or bias. Few, however, have considered the actual performance of the scales in their investigations. In addition, there has been little consideration regarding how well rating scales define constructs such as English communicative competence. This area of research is extremely important considering that rating scales not only play an important part in determining test takers’ current level of communicative competence, but they are also used to define the cut point for university admissions. This investigation is also unique in that it analyzes the English section of a university entrance examination. While English entrance examinations have had a longstanding role in admission policies in Japanese postsecondary education, there have been few empirical studies (e.g., Ito, 2005; Shizuka, Takeuchi, Yashima, & Yoshizawa, 2006) investigating the validity and reliability of these examinations. Rasch measurement theory can provide a number of valuable insights when investigating rating scale performance. Usually the focus of a Rasch analysis centers upon the relationship between person ability and item difficulty. A Rash analysis facilitates this type of comparison because both person ability estimates and item difficulty estimates are placed upon the same logit scale. This common scale in turn makes it possible to formulate probabilistic statements concerning test takers’ chances of correcting answering a particular item on a test. This information, of course, can be informative when determining cut points for an entrance examination (Weaver, 2005). When a test utilizes rating scales, a Rasch analysis can also provide a more comprehensive account of person ability with parameter estimates. These estimates define person ability in terms of the probability of a test taker passing through the different thresholds of a rating scale. In other words, Rasch measurement theory examines rating scale performance from two interrelated perspectives: (a) the number of points on a scale, and (b) the thresholds that separate these points. Figure 1 is a graphical representation of well-performing 4-point rating scale. Before progressing further, it should be noted that that the different points on a rating scale can also be referred to as categories. Thus, a rating scale where test takers can receive a score ranging from 0 to 3 points has four categories: 0, 1, 2, and 3. Another important detail is that the value of the different points on a rating scale does not necessarily need to be equal. In some cases, the points on a rating scale may have different point values in order to reward test takers who progress further along the rating scale. In this situation, certain points on the rating scale represent performance thresholds that have weighted point values (Wright & Masters, 1982). To simplify the present investigation of rating scales used in a high-stakes entrance examination, the different points on the ratings scales are treated as equal to the point values awarded to test takers.
58
Figure 1. A Well-Performing 4-Point Rating Scale.
The numbers above the different probability curves in Figure 1 show that this scale ranges from 0 to 3 points. The vertical dashed arrows indicate the scale’s three thresholds. Each threshold represents the 50% probability of test takers receiving one score over another (Wright & Masters, 1982). The first threshold for this scale is located at -2.2 logits on the xaxis. Test takers at this location, which means that the item’s level of difficulty exceeds their person ability estimate by 2.2 logits, have a 50% chance of receiving 0 or 1 on that item. If test takers’ person ability estimates exceed an item’s level of difficulty by 0.2 logits, then they are at the second threshold where there is a 50% chance of receiving either 1 or 2 points. The final threshold requires test takers to have a person ability estimate that exceeds the item’s level of difficulty by 2 logits to have a 50% chance of receiving 2 or 3 points on this scale. The monotonic ordering of the different probability curves and the thresholds suggests that this particular scale is performing at an optimum level. In other words, Rasch measurement theory asserts test takers’ chances of receiving a higher score on a rating scale increases as the test takers’ person ability estimate exceeds the item’s level of difficulty, as is the case in Figure 1. Deriving the probabilistic relationship between person ability, item difficulty, and the different thresholds in a rating scale relies upon Andrich’s (1978) rating scale model. Defined within the context of the English section of a university entrance examination, the rating scale model estimates the probability of test takers receiving a particular score on any examination question using a rating scale as a function of the test takers’ level of communicative competence (Bn) and the difficulty of the test item (Di) at a given threshold (Fk). This probability (Pnik) expressed as a formula is: Pnik= e (Bn - Di - Fk) /{1+e (Bn - Di - Fk) }.
59
Similar to other data-model techniques, the rating scale model along with other Rasch models utilizes an iterative process to generate estimates of person ability, item difficulty, and category thresholds. The iteration process begins with an initial approximation of the item difficulties. These estimates then inform the first round of person ability estimates. This iterative process is repeated again in order to reduce the size of the residuals existing between the data and the model expectations of what the data should look like. The WINSTEPS program (Linacre, 2004c) continues this iteration process until the residuals are brought to a level of 0.5, which translates into the largest change of person ability and item difficulty estimates being less than 0.01 logits (Bond & Fox, 2001, p. 183). The remaining residuals inform the fit statistics, which are quality control measures to assess the extent to which the data collected conforms to the expectations of the Rasch measurement model. Research Questions Using Rasch measurement theory, this paper investigates the psychometric properties of three sets of rating scales used in the English section of a university entrance examination. This focal point can provide valuable insights when designing, implementing, evaluating, and in some cases revising rating scales. Moreover, it can lead to an interesting analysis that focuses upon how different types of rating scales collectively define test takers’ level of communicative competence. The end goal of this investigation is to optimize rating scale performance in order to produce more valid and reliable entrance examination scores. The following research questions guide this investigation: 1. To what extent do the three sets of rating scales conform to the expectations of the Rasch measurement model? 2. How well do the three sets of rating scales collectively define test takers’ level of communicative competence? Methods Participants The participants were 1108 out of 1665 Japanese students who attempted a particular subsection of questions on a university entrance examination. The average Japanese student ready to enter a university has 6 years of foreign language instruction, with opportunities to use English occurring largely inside the classroom. Foreign language instruction in high school predominately focuses upon receptive language skills; however, communicativeorientated language use is beginning to appear in the high school curriculum. The Items The three sets of rating scales under investigation belong to one subsection of an English entrance examination. This subsection featured a figure accompanied by three sets of questions. The first set of questions had 3 items (items 1.1, 1.2, and 1.3) that required test takers to provide a short written response to three questions about the figure. Test takers’ responses were graded according to their communicative value on a 3-point rating scale. The second set of questions involved 4 items (items 2.1, 2.2, 2.3, and 2.4) that featured a four-line two-way dialogue related to the figure. Test takers had to read the dialogue and write a short response in English based on the dialogue and the figure. These items were graded using a 4point rating scale. Test takers were penalized for grammatical and spelling errors. The third
60
set of questions required test takers to write a series of English sentences using information from the figure. These sentences (items 3.1, 3.2, and 3.2) were graded with a 5-point scale. Test takers were penalized for misinterpreting the figure, grammatical errors, and spelling mistakes. Test takers’ level of communicative competence is thus defined by their ability to comprehend the different questions and dialogues in this subsection of the entrance examination as well as their ability to interpret and write about the figure in English which then was assessed using a variety of different grading criteria. Analyses As previously mentioned, the Rasch approach to measurement has a number of expectations concerning rating scale performance. The monotonic progression of the probability curves and the thresholds illustrated in Figure 1 are just two of these expectations. This paper’s first research question, the extent to which the three sets of rating scales conform to the expectations of the Rasch measurement model, involves a systematic investigation of the different properties of a rating scale’s performance. These properties in turn not only provide a detailed analysis of individual rating scales, but also suggest ways on how they might be revised for optimal performance in future entrance examinations. Category Frequencies and Distributional Properties of a Rating Scale The first property of a well-performing rating scale is the frequency in which the different categories are used. Each category should be used at least 10 times (Linacre, 1999). This benchmark ensures that there are enough occurrences to produce stable threshold values. Categories with infrequent occurrences suggest redundancy in the scale that can be addressed by combining the scale into fewer categories. The second property concerns the distributional properties of the scale. Ideally, the distribution should be uniform across the different categories in the scale (Bond & Fox, 2001). Other meaningful distributions include unimodal distributions peaked at the central point of the scale or bimodal distributions peaking at extreme categories (Linacre, 2004a). Distributions that are highly skewed usually indicate categories with low frequency counts that in turn threaten the chances of achieving stable threshold estimates. The distributional properties of the scale also give researchers a rough idea about the relative difficulty that a set of items posed for test takers. If the distribution is negatively skewed with many ratings occurring at the top categories of the scale, test takers did not experience much difficulty with the items. The opposite is true for a set of items that has many ratings occurring at the bottom of the scale. Displaying the frequency counts as overall percentages also provides a quick means of assessing the frequency of the different categories in a rating scale. Average Observed Measure The third property focuses on the average observed measure across the different categories in a rating scale. The average observed measure is the average ability measure for all test takers who receive a particular score on a rating scale (Linacre, 2004c). A concrete example may illustrate how this particular property can be illuminating when assessing rating scale performance. For example, let’s consider a 3-point rating scale with possible scores of 0, 1, and 2. If the observed average measure for the lowest category of a scale is -0.17, that -0.17 is the average ability estimate for all the test takers who scored 0 out of 2. This information allows researchers to reflect on the relationship between the level of difficulty of each step in
61
a rating scale and test takers’ overall level of ability, e.g., communicative competence. Thus, test takers who score 0 have an ability estimate that is slightly below the average difficulty of all the items on the entrance examination (set at 0 logits). This starting point is reassuring because it is consistent with the Rasch measurement assertion that test takers’ with low levels of communicative competence should receive lower scores on a rating scale. The Rasch model also expects that the observed average measure for the next category, a score of 1, should be higher. This hierarchical ordering reflects the underlying argument of Rasch measurement theory that as test takers move along the continuum of communicative competence their chances of receiving a higher score increases. Disruptions in the expected monotonic progression thus give researchers a focal point to further investigate a rating scale’s performance. Continuing with the example of a 3-point scale, let’s say the observed average measure for category 1 is .87 logits; however, the observed average measure for category 2 is .27. This disordering should prompt researchers to further investigate why test takers with a higher average level of communicative competence received fewer points than test takers with a lower average level of communicative competence. One possible explanation may be that the item features a grammatical rule that higher ability test takers often overgeneralize. Another possibility may be that raters are more severe with more proficient test takers (e.g., Schaefer, 2004). Once again collapsing the scale is one possible means of addressing this issue. However, it is advisable to investigate the potential source(s) of the disordering in order to gain a deeper understanding of how the rating scale is performing before beginning to combine the categories. Category Thresholds The fourth property involves a number of interrelated matters concerning category thresholds. Similar to average observed measures, the category thresholds should also increase monotonically along the continuum of communicative competence. In other words, test takers’ progress along the communicative competence continuum should be mirrored by an increased possibility of them passing through the different thresholds toward higher scores on the scale. Once this monotonic progression is assured, the spacing between the different category thresholds is the next focal point. This spacing is important when determining whether or not the different categories in the scale capture distinct steps along the communicative competence continuum. Linacre (1999) suggests that the spacing should surpass 1.4 logits, but not exceed 5 logits. In the latter case, vast distances between thresholds may create holes when defining the construct of communicative competence. The spacing between thresholds should also be roughly equivalent in order to provide a consistent assessment of test takers’ level of communicative competence. Category Fit The fifth property relates to the fit statistics accompanying each category in a rating scale. This quality control measure provides an indication of the fit between the expectations of the Rasch model and the different scores awarded on a rating scale. Linacre (1999) recommends that categories with an outfit mean square exceeding 2 are contributing more noise than information to the measurement process.
62
Dealing with Nonperforming Rating Scales Rating scale performances in actual testing situations such as university entrance examinations may fail to exhibit some of the properties discussed above. In this situation, there are two strategies to help the data fit the expectations of the Rasch model. The first strategy involves deleting data from categories that had low frequency counts (Linacre, 1999; 2004a). In the context of university entrance examinations, omitting data is simply not a viable option for a number of reasons. The second and more practical strategy is combining categories together in order to create a more robust structure of high frequency categories. This strategy can thus help a rating scale come closer to the expectations of the Rasch model. There are, however, no steadfast rules, but rather general principles and practical considerations that guide how and which categories should be collapsed together (Bond & Fox, 2001). Thus, any results derived from combined categories should be verified with another administration of the test items using the revised rating scales (Smith, Wakely, deKruif, & Swartz, 2003). The Combined Effect of Rating Scales Thus far the focus of this paper has been upon the performance of individual rating scales. However in many testing situations, a collection of different rating scales may be utilized to assess test takers’ abilities. The combined effect of different rating scales is the central concern of this paper’s second research question, the extent to which the three sets of rating scales collectively define test takers’ level of communicative competence. The performance of different sets of rating scales can be assessed from two perspectives. The first perspective is the range of person ability that the different rating scales collectively define. Ideally, the rating scales should cover a large enough range so that it is possible to differentiate statistically distinct groups within the population of test takers. The second perspective involves the concentration of categories around important levels of person ability such as the probable cut-score for university admissions. A higher concentration ensures that the standard error estimates accompanying person ability estimates and item difficulty estimates around the probable cut-score are as small as possible. The end result is thus a more concise picture of measurement accuracy derived from test takers’ responses on the entrance examination. Results Analysis of the First Set of Questions and the 3-Point Rating Scale The first set of items utilized a 3-point rating scale to score test takers’ responses to three questions about a figure. Test takers’ responses were graded on the basis of whether or not specific features on the figure could be identified from what they wrote. Table 1 shows how this rating scale performed using this communicative-orientated grading criterion. The observed counts of the different categories reveal that 73% of the test takers received full marks on this set of items. The observed averages for each category also show a misordering (noted with an asterisk) for category 1. Thus, test takers’ who received a score of 1 had a lower person ability estimate than test takers who received a score of 0. In addition, the category thresholds are misordered with the first threshold occurring at 0.14 logits and the second one at -0.14 logits. This reversed order is related to the fact that the category of 1 is
63
never the most probable score for test takers. Figure 2 illustrates this finding with the shape probability curve for a score of 1 never peaking above the other categories in the scale. Table 1. The Performance of the 3-Point Rating Scale using a Communicative-Orientated Grading Criterion category label 0 1 2
observed count 349 552 2384
percent 11% 17% 73%
observed averages 0.75 0.61* 1.87
structure calibration None 0.14 -0.14
outfit MNSQ 2.50 0.40 1.01
Figure 2. The Probability Curves for the 3-Point Rating Scales
The differences between the category thresholds also do not exceed the benchmark of 1.4 logits, suggesting that the different steps in the rating scale do not reflect distinguishable levels of communicative competence. Finally, the outfit mean square for the first category exceeds 2, with a value of 2.50. Figure 3 illustrates how the observed occurrences for category 0 depart from what is expected. The Rasch model projected that test takers whose ability level either equaled or exceeded this set of items’ level of difficulty by one logit would receive a score of zero. However, these test takers had a greater chance of receiving some credit on these items. This finding suggests that there is more misinformation than information being introduced into the measurement of these test takers’ level of communicative competence.
64
Figure 3. The Mismatch between the Expected and Observed Scores for Category 0. The poor performance of this 3-point rating scale, however, may be the result of the different grading criteria used in this subsection of the entrance examination. The first set of questions utilized a communicative grading criterion, whereas the other two sets of questions also considered the grammatical accuracy of test takers’ responses. As a result, the performance of the 3-point rating scale may have been unduly influenced by this difference in grading criteria. One area of concern is the calculation of the observed measures for the different categories. As it stands now, the observed average measure for each category is based upon all of the items in the subsection of the entrance examination. Thus, it is possible that a test taker may have received higher scores on the first set of items, which uses the communicative grading criterion, but lower scores on the other two sets of items because they also take into consideration grammatical accuracy. This inconsistency between the different grading criteria might thus explain the nonhierarchical nature of the observed average measures in the 3-point rating scale. In order to investigate this possibility, the 3-point rating scale was analyzed using only the test takers’ responses from the first set of questions. Table 2 shows the performance of this rating scale independent of the other two rating scales used in this subsection of the entrance examination.
Table 2. The Performance of the 3-Point Rating Scale Analyzed Separately from the Two Other Rating Scales category label 0 1 2
observed count 340 552 1259
percent 16% 26% 59%
observed structure averages calibration -0.26 none 0.24 -0.47 2.25 0.47
outfit MNSQ 1.71 0.41 1.07 65
Analyzing the 3-point rating scale in isolation reveals a much improved performance. Table 2 shows a drop in the number of times that the test takers received full scores for their responses. The observed average measures of the test takers now increase in a monotonic manner starting at -0.26 logits and continuing to 2.25 logits. The category thresholds are also monotonically ordered with the probability curves for each category now having its own distinct peak (as illustrated in Figure 4). Although the distance between the category thresholds does not reach the benchmark of 1.4 logits, the category thresholds are 0.96 logits apart, an improvement of 0.68 logits. Finally, all of the categories now have outfit mean squares below 2.
Figure 4. The Probability Curves for the 3-Point Rating Scales Analyzed Separately from the Other Two Rating Scales.
These results, however, need to be interpreted cautiously because the number of test takers excluded from the analysis dramatically increases when the 3-point rating scale is analyzed in isolation. This exclusion relates to the fact that the Rasch model requires at least one item to be beyond or below the test takers’ current level of ability to yield a person ability estimate (Linacre, 2004b). An independent analysis of the 3-point scale thus results in an exclusion of 391 test takers with 383 of them receiving a perfect score on the first set of questions and the remaining eight test takers receiving no credit for their responses. This number is in sharp contrast to the 15 test takers excluded when the analysis involves all three sets of questions with their accompanying rating scales. The exclusion of test takers is problematic because the person ability estimates used to calculate the probability curves in this independent analysis are less representative of the larger population of test takers. Thus, the seemingly improved performance of the 3-point scale must be weighed against the fact that a fewer number of test takers are included in the analysis. The relatively poor performance of the 3-point rating scale in relation to the other two rating scales suggests the possibility of combining the scale’s categories together in an attempt to improve its performance. Table 3 shows two possible strategies for rescoring the 3-
66
point rating scale into a dichotomous rating scale. The second column on the right side of Table 3 is the rescoring code used in the WINSTEPS’ (Linacre, 2004c) command file to combine the different categories together. The third column is the person separation index. This measure is roughly equivalent to traditional test reliability with low values indicating a low range of person measures (Linacre, 1997). Thus in the context of evaluating different rescoring strategies, a greater person separation index means that the revised rating scale improves the capability of this subsection of the entrance examination to distinguish test takers along the communicative competence continuum.
Table 3. Rescoring Strategies for the 3-Point Rating Scale rescaled categories
person separation index
011
1.28
001
1.28
category label
observed count
percent
observed averages
structure calibration
outfit MNSQ
0 1 0 1
249 2924 901 2384
11 89 27 73
1.71 2.63 0.18 2.06
none 0.00 none 0.00
1.10 1.00 1.12 1.06
The first rescoring strategy combines categories 2 and 3 together so that test takers receive full credit even if their responses did not clearly identify a specific feature on the figure. The second rescoring strategy combines categories 1 and 2 together resulting in a more severe grading criterion where test takers receive credit only if their response clearly identifies a specific feature on the figure. Both rescaling strategies result in improved rating scale performances with the observed average measures for each category increasing in a monotonic fashion. The outfit statistics for the different categories are also within an acceptable level. In addition, this rescoring strategy improves the personal separation index for this subsection of the entrance examination from 1.26 to 1.28. In summary, the 3-point scale for the first set of items initially featured misordered observed averages between the first two categories, which in turn caused misordered category thresholds with the category 1 never being the most probable score awarded to test takers. The outfit mean square of the category 0 also exceeded the benchmark of 2. These departures from the Rasch model, however, may be due to the fact that this rating scaled had a different grading criterion compared to the other rating scales used in this subsection of the entrance examination. A separate analysis independent of the other two rating scales revealed a much improved 3-point rating scale. This gain in performance, however, comes at the expense of excluding 35% of the test takers. Ultimately, transforming the 3-point scale into a dichotomous rating scale resulted in improved rating scale performance. Combining category 1 with either category 0 (i.e., the severe grading criterion) or category 2 (i.e., the more lenient grading criterion) helped this rating scale meet the expectations of the Rasch model. In addition, these rescoring strategies slightly improved the person separation index for this subsection of the entrance examination.
67
Analysis of the Second Set of Questions and the 4-Point Rating Scale The second set of four questions utilized a 4-point scale to score test takers’ written responses to a four line two-way dialogue referring to the figure. Test takers were penalized for any grammatical and spelling errors. Table 4 shows the performance of this 4-point rating scale.
Table 4. The Performance of the 4-Point Rating Scale category observed observed structure outfit label count percent averages calibration MNSQ 0 760 17% 0.06 none 1.56 1 62 1% 0.31 2.64 0.82 2 456 10% 0.63 -1.54 0.77 3 3102 71% 1.02 -1.11 1.02
Similar to the first set of questions, the number of full points awarded test takers is high at 71%. Yet, the observed average measure for each category increases monotonically from 0.06 to 1.02 logits. This ordering suggests that the different categories are detecting differing levels of communicative competence amongst the test takers. These differences, however, do not produce monotonically ordered category thresholds. Figure 5 shows that categories 1 and 2 are never the most probable scores given to test takers. The distances between the disordered category thresholds also never reach the benchmark of the 1.4 logits. On a positive note, the outfit mean squares for the different categories do not exceed 2.
Figure 5. The Probability Curves for the 4-Point Rating Scale.
68
Although the observed average measures for categories 1 and 2 suggest that different levels of communicative competence exist, the infrequent use of these categories creates a scale with disordered category thresholds. Thus, similar to the previous analysis of the 3-point scale, it is worthwhile considering whether combining less frequently used categories with more frequently used categories would result in improved scale performance. Table 5. Possible Rescoring Strategies for the 4-Point Scale and Their Results person rescaled separation category observed observed structure categories index label count percent averages calibration 0112 1.36 0 760 17 0.21 none 1 518 12 0.66 0.80 2 3102 71 1.29 -0.80 0111 1.40 0 760 17 0.9 none 1 3616 83 2.24 0.00 0011 1.40 0 822 19 0.87 none 1 3554 81 2.08 0.00 0001 1.42 0 1270 29 0.4 none 1 3102 71 1.37 0.00
outfit MNSQ 1.30 0.87 1.02 1.03 1.04 1.13 1.04 0.99 0.98
The first rescoring strategy combines categories 1 and 2 together in an attempt to strengthen the step between receiving no credit and receiving full credit. Although this approach does increase the number of occurrences at the midpoint of the scale, category 1 never becomes the most probable score awarded to test takers (as illustrated in Figure 6). This rescoring strategy, however, improves the personal separation index from 1.28 to 1.36 for this subsection of the entrance examination.
Figure 6. The Probability Curves for the First Rescoring Strategy.
69
The probability curves shown in figure 6 suggest the possibility that the second set of questions might be best treated as a set of dichotomous items. Thus, the remaining three rescoring strategies use increasingly severe grading criteria to combine the 4-point rating scale into a dichotomous rating scale. The second strategy provides test takers with full credit regardless of spelling or grammatical errors. The third strategy allows test takers to make one spelling or grammatical error without being penalized. The fourth strategy only rewards correct responses with no spelling and/or grammatical errors. Table 4 shows the performance of these three rescaling strategies. One not surprising trend is that as the grading criterion becomes more severe, the number of perfect scores awarded decreases. This relationship thus highlights the substantial role grading criteria play in determining the level of difficulty of this set of items. The dichotomous rescoring strategies also produce better person separation indexes with the most severe grading criterion providing the greatest gain: 1.42. In summary, the 4-point rating scale used with the second set of items initially did not have many observed counts at the midpoints of the scale. The infrequent use of these categories suggested the possibility of combining them together in order to establish a more meaningful step between receiving no credit and receiving full credit. However, the results of this rescoring strategy revealed that a dichotomous grading scale may be a more suitable approach. Transforming the 4-point scale into a dichotomous rating scale involved experimenting with three rescoring strategies that differed according to their severity. All three improved the entrance examination’s person separation index with the most severe grading criterion contributing the greatest gain. Analysis of the Third Set of Questions and the 5-Point Scale The third set of items utilized a 5-point rating scale to assess test takers’ ability to write a series of sentences using information from the figure. Test takers were penalized for misinterpreting the figure, grammatical errors, and spelling mistakes. Table 6 shows the performance of this 5-point rating scale for this set of questions.
Table 6. The Performance of 5-Point Rating Scale category label 0 1 2 3 4
observed count 1142 435 694 605 409
percent 35% 13% 21% 18% 12%
observed structure averages calibration -0.91 none -0.47 0.29 -0.22 -0.88 0.00 0.05 0.44 0.58
outfit MNSQ 0.76 0.82 0.62 0.92 0.89
Unlike the two previous sets of questions, the distribution of the observed counts for this rating scale is positively skewed with category 0 being the most frequent score given to test takers. In other words, the third set of items in this subsection of the entrance examination posed a greater challenge for test takers. The observed average measures for the different categories are also hierarchically ordered ranging from -0.91 to 0.44 logits. This range suggests that the different categories are able to detect differences in test takers’ level of
70
communicative competence. The category thresholds, however, are not monotonically ordered, with category 1 never being the most probable score awarded to test takers (as shown in Figure 7). Because the distances between the different thresholds do not reach the benchmark of 1.4 logits, the differences between the categories may also not be distinct. Finally, all of the outfit statistics are within an acceptable range.
Figure 7. The Probability Curves for the 5-Point Rating Scale.
The improbable score of 1 suggests that it might be fruitful to combine this category with the one above it in an attempt to make every step of the scale the most probable score for test takers at varying levels of communicative competence. This rescoring strategy thus requires recoding the data from 01234 to 01123. Table 7 shows the performance of this new 4-point scale.
Table 7. The Performance of the Rescored 4-Point Scale category label 0 1 2 3
observed count 1142 1129 605 409
percent 35% 34% 18% 12%
observed structure averages calibration -1.04 none -0.47 -0.72 -0.11 0.26 0.46 0.35
outfit MNSQ 0.83 0.72 0.92 0.94
This rescoring strategy produces a scale where all of the categories are the probable score somewhere along the communicative competence continuum (as shown in Figure 8). The distances between the different category thresholds, however, remain a concern. This is especially true of the last two thresholds where the distinction between good and superior
71
levels of English communicative competence is less defined. In addition, the new 4-point rating scale reduces the person separation index from 1.46 to 1.34. This substantial drop is a result of eliminating the differences that existed between categories 1 and 2. Thus, this rating scale might best be left as a 5-point scale that detects different levels of communicative competence despite category 1 never being the most probable score given to test takers.
Figure 8. The Probability Curves for the Rescored 4-Point Rating Scale.
In summary, the 5-point rating scale initially seemed suited to be combined into a 4point scale because of the relatively small distances between category thresholds and its disordered category threshold structure. Combining categories 1 and 2 together, however, reduced the performance of the scale in terms of its contribution to the person separation index. In addition, the monotonically ordered category thresholds compressed the distances between the last two thresholds, which is an important part of the scale that distinguishes good from superior levels of communicative competence. As a result, the original 5-point scale provides the optimal performance despite category 1 never being the most probable score given to test takers. The Combined Effect of the Different Rating Scales So far this investigation has focused upon the performances of the different scales in isolation. However, these scales work together in this subsection of the university entrance examination to define test takers’ level of communicative competence. Thus, it is imperative to investigate the collective performance of these scales. This analysis utilizes two interrelated perspectives. The first examines the range of person ability estimates that the scales collectively define. The second investigates the extent to which the rating scales target the test takers’ level of communicative competence so that there is a concentration of categories around the probable cut point for university admissions. Figure 9 shows the initial performance of the three rating scales in relation to the test takers’ person ability estimates. Before continuing onto this analysis, it is worthwhile to first review the different pieces of the information contained in Figure 9 and how they relate to
72
each other. Understanding these relationships will enable a clearer understanding of the collective performance of the three rating scales. On the right side of the Figure 9, there is a list of the items that compose this subsection of the English entrance examination. These items are organized according to their level of difficulty. Item 3.2 is the most difficult question because test takers needed a person ability estimate of 1.85 logits to have a 50 per cent chance of receiving the maximum score of 4 (the long double-headed arrow shows this relationship). In contrast, test takers needed a person ability estimate of 0.71 to receive the maximum score of 2 on the easiest question, item 1.1 (illustrated by the shorter double-headed arrow). At the bottom of Figure 9, there is a list of the test takers organized according to their person ability estimate. The number of test takers at each level should be read vertically. For example, there are 41 test takers whose level of communicative competence is at 0 logits. It should also be noted that 0 logits represents the average level of item difficulty for this subsection of the examination. This score thus provides an important referent point when comparing the average item difficulty with the test taker’s average ability. The M at the bottom of Figure 9 represents the mean score for the test takers. Once again reading vertically, 68 test takers are at this level of communicative competence. Returning to the comparison between item difficulty and test taker ability, we can conclude that the test takers’ ability exceeded these items’ average level of difficulty. The S and T marks at the bottom of Figure 9 signify one and two standard deviations above and below the average test takers’ person ability estimate.
Figure 9. The Collective Performance of the Original Rating Scales. The Collective Performance of the Original Rating Scales The different categories of three rating scales cover a range of 2.12 logits, from -0.27 logits for items 1.1, 2.1 and 2.3, to 1.85 logits for item 3.2. This range of categories provides a substantial overlap with the test takers’ differing levels of communicative competence. Test takers located on the ends of the continuum, however, are not covered. In context of an English university entrance examination, the lack of categories at the bottom end of the communicative competence continuum is not problematic considering that admission decisions are not made at this end of the continuum. At the other extreme, the lack of 73
categories for test takers whose person-ability estimate exceeds 1.85 logits creates a potential ceiling effect where there are no possible scores to differentiate the top 43 test takers. The lack of categories at the upper levels, however, may also not be a serious matter considering that the cut point for university admissions will probably be located somewhere between the mean and one standard distribution above the mean (represented by the shaded box in Figure 9). Within this range of person ability, all of the different rating scales, with the exception of item 1.1, have a category that test takers have a 50% possibility of receiving. The Collective Performance of the Revised Rating Scales The evaluation of the individual rating scales suggested that two of them could be revised to improve their performance. These changes involved transforming the 3-point and 4point scales into dichotomous rating scales. In the case of the 3-point scale, both rescoring strategies led to improved scale performance and an increased person separation index; whereas, for the 4-point rating scale, the most severe grading criterion provided the greatest gains. The two rescoring strategies for the 3-point rating scale, however, differed in terms of their impact upon the collective performance of all rating scales used in this subsection of the entrance examination. The lenient rescoring strategy produced a slightly better person separation index of 1.57, compared to the 1.56 achieved with the severe rescoring strategy. As a result, the collective analysis of the rating scales uses the lenient grading criterion for the first set of items and the severe grading criterion for the second set of items. Figure 10 shows the collective performance of the revised rating scales. The range that the different categories cover increases to 3.34 logits, from -0.28 logits for item 1.1 to 3.06 logits for item 3.2. Although this increased coverage predominately occurs at the higher end of the communicative competence continuum, the number of test takers whose personal ability estimates exceeds the last category of item 3.2 is only slightly reduced. The revised scales, however, have a higher concentration of categories within the targeted area (once again represented by the shaded box). This concentration in turn helps provide more stable person ability estimates in an area of the communicative competence continuum which is important for university admission decisions.
Figure 10. The Collective Performance of the Revised Rating Scales.
74
In summary, the collective performances of the original rating scales and the revised rating scales differed in terms of the range of person ability estimates they covered. This difference improves the person separation index for this subsection of the entrance examination from 1.26 to 1.57. As a result, the items in this subsection do a better job identifying test takers’ location along the communicative competence continuum. The revised scales’ enlarged area of coverage, however, only resulted in the inclusion of a few more test takers with high levels of communicative competence. The revised scales most significant contribution thus lies in the increased concentration of categories in the target area between the mean person ability estimate and one standard deviation above it. This higher concentration helps ensure that the person ability estimates in this important area of the communicative competence continuum are as stable as possible. Discussion It is important to note that these analyses were not completed to determine test takers’ scores on the English section of the university entrance examination. The focus of this investigation rests solely on the performance of the different rating scales and the types of items utilized in the entrance examination to assess test takers’ level of communicative competence. The end goal is to optimize rating scale performances for future examination administrations by providing test writers with empirical information that will assist them in designing more sensitive rating scales. This information can also help raters to reflect upon their use of different rating scales as they grade test takers’ responses. With these goals in mind, the present investigation provides the following insights and suggestions for further improvement. The First Set of Items and the 3-Point Rating Scale The combination of misordered observed averages and disordered category thresholds in the original three-point rating scale highlights the issue of consistency between different grading criteria used to assess test takers’ level of communicative competence. In isolation, this 3-point rating scale met many of the Rasch model’s expectations. Yet when it is analyzed with the other rating scales, the 3-point rating scale performs better as a dichotomous rating scale. This finding is not to argue that all questions on an entrance examination must be graded according to the same criterion. However, different grading criteria not only influence rating scale performance, but also the location of test takers’ on the communicative competence continuum. As a result, it is important that whatever criterion is used the rating scales perform at their optimal level. The Second Set of Items and the 4-Point Rating Scale Another important issue arising from this investigation is the degree of compatibility between the output required of an item and the rating scale. For example, the second set of questions required test takers to provide a minimal written response to demonstrate that they could interpret the figure with the aid of the four line two-way dialogue. Differentiating test takers’ ability to do so with a 4-point rating scale resulted in disordered category thresholds. The application of the different rescoring strategies ultimately led to the conclusion that this 4-point rating scale should be collapsed into a dichotomous rating scale to maximize its performance. In short, the revised dichotomous rating scale reflects the true demands of the
75
second set of items with the most severe grading criterion providing the best person separation index. The observed average measure for each category in the original 4-point scale, however, detected a continuum of communicative competence. This finding thus suggests that differences between test takers might be more salient if the demands of the item were increased. Previous investigations (Weaver & Romanko, 2005; Weaver & Sato, 2005) examining this aspect of item design have found that the demands that an item places on test takers influences its level of difficulty on university entrance examinations. One way to increase the demands for this set of questions might involve requiring test takers not only to correctly interpret the figure, but also to write a descriptive account that corresponds to the information provided in the four line two-way dialogue. This added demand would require test takers to produce longer responses that in turn would provide more opportunities to differentiate test takers’ abilities with a 4-point scale. Thus, it is imperative when designing rating scales to consider the compatibility between the demands of an item and the number of categories that a scale is to have in order to maximize rating scale performance. The Third Set of Items and the 5-Point Rating Scale Another important issue emerging from this investigation relates to the implications of combining categories to meet the expectations of the Rasch model. In the case of the 5-point scale, category 1 was never the most probable score awarded to test takers. Although combining categories 1 and 2 resulted in monotonically ordered category thresholds, the person separation index dropped considerably resulting in a less sensitive measure of test takers’ current level of communicative competence. This loss of sensitivity is especially apparent with the close proximity of the last two thresholds, which distinguish the upper levels of communicative competence. The close proximity of these thresholds is also of concern if one considers the measurement error associated with exact location of the different category thresholds. The original 5-point rating scale thus seems to offer the best performance despite not meeting some of the Rasch model’s expectations. In other words, the Rasch model provides numerous sources of information concerning rating scale performance, which sometimes needs to be prioritized in order to meet to the needs of a particular testing context. The Collective Performance of the Different Rating Scales Analyzing the collective performance of the different rating scales provides keen insights into how this subsection of an English entrance examination defines test takers’ level of communicative competence. This type of analysis provides test writers with an opportunity to examine the range of person ability estimates that different rating scales individually and collectively cover. This information can be a valuable resource when designing rating scales for future examinations so that there are enough rating scale categories around the level of communicative competence that may serve as the cut point for university admissions. A contrastive analysis between the original and the revised rating scales also provides test writers with an indication of the extent to which rating scale optimization can improve the performance of an entrance examination. For example, this investigation found that the number of test takers whose person ability estimate was more than one standard deviation above the mean did not differ significantly between the original and revised scales. Thus if there is a need to differentiate this level of test takers, the minimal difference between the original and the revised scales suggests that the best approach would be to either increase the
76
difficulty level of the preexisting questions or include another set of items targeting this level of communicative competence. In short, rating scale optimization is only one component of sound measurement. Limitations of this Investigation and Areas for Future Study The results arising from this investigation need to be considered carefully. Insights gained from analyzing and revising the different rating scales, for example, cannot directly be applied to the next administration of the English entrance examination because the questions on these examinations are never reused. After each administration, the university publicly discloses the questions on the examination, which then appear in numerous commercially produced test preparation books. Information pertaining to rating scale optimization is, however, a valuable resource when developing rating scales for future entrance examinations that feature question types similar to the ones analyzed in this investigation. An area for future research might thus be the examination of the degree to which the benefits of rating scale optimization transfer across similar item types and over multiple administrations of an entrance examination. Recommendations for rating scale optimization must also consider the extent to which the test takers, whose responses are scored with the different rating scales, represent the larger population of test takers. The present investigation, for example, is based upon 1108 test takers’ responses. However, this sample size does not include the 561 test takers’ who took the entrance examination but did not attempt these three sets of questions. Thus, another interesting line of research might be to investigate the extent to which different populations of test takers influence rating scale performance. The act of transforming the 3-point and 4-point rating scales into dichotomous rating scales reveals not only the impact that different grading criteria have upon rating scale performance, but also the qualitative decisions involved in this process. This investigation, for example, constantly attempted to maximize this subsection of the entrance examination’s person separation index. Future investigations might examine how different criteria, such as maximizing the distance between the category thresholds, influence which categories are combined to optimize rating scale performances. Conclusion Optimal rating scale performance is an important component of sound measurement practice that helps produce valid and reliable test scores in high-stakes assessment situations such as a university entrance examination. This investigation’s focus on the internal properties of rating scales illustrates how different grading criteria and question-type demands mediate rating scale performance. Moreover, this investigation demonstrates how rating scale performance can be analyzed individually or collectively, which in turn provides insights into the interesting interaction between these two perspectives. Finally, this investigation reveals the need to sometimes prioritize which expectations of the Rasch measurement model are essential as well as the extent to which rating scale optimization can improve the performance of a set of test items. Taken together these insights provide a foundation for further rating scale development and refinement with the ultimate goal of optimized performance in order to meet the needs of a particular assessment situation, whatever they might be.
77
Acknowledgements I would like to thank the English Language Institute of the University of Michigan for funding this research. I am deeply grateful to Yoko Sato for critical comments and helpful insights on this analysis. I would like to express my thanks to Rick Romanko, Masanori Funakura, and many other professors at Tokyo University of Agriculture and Technology who have helped me in one way or another with this project. Also heartfelt appreciation goes to Hidefumi Kobatake, Katsuaki Sato, Seizo Miyata, Noatoshi Kanda, and Masakuni Matsuoka. Without their commitment to this project, it would have never happened. References Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561–573. Bond, T., & Fox, C. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, NJ: Lawrence Erlbaum. Ito, A. (2005). A validation study on the English language test in a Japanese nationalwide university entrance examination. Asian EFL Journal, 7(2), 91–117. Kondo-Brown, K. (2002). An analysis of rater bias with FACETS in measuring Japanese L2 writing performance. Language Testing, 19(1), 1–29. Linacre, J. (1997). KR-20 or Rasch reliability: Which tells the “truth”? Rasch Measurement Transactions, 11(3), 580–581. Linacre, J. (1999). Investigating rating scale category utility. Journal of Outcome Measurement, 3(2), 103–122. Linacre, J. (2004a). Optimizing rating scale category effectiveness. In E. Smith & R. Smith (Eds.), Introduction to Rasch measurement (pp. 258–278). Maple Grove, MN: JAM Press. Linacre, J. (2004b). Rasch model estimation: Further topics. Journal of Applied Measurement, 5(1), 95–110. Linacre, J. (2004c). WINSTEPS (Version 3.57.1) [Computer software]. Chicago: Winsteps.com. Ministry of Education, C., Sports, Science, and Technology. (2003). Regarding the establishment of an action plan to cultivate “Japanese with English abilities.” Retrieved April 1, 2004, from www.mext.go.jp/english/topics/03072801.htm Myford, C., & Wolfe, E. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part 1. Journal of Applied Measurement, 4(4), 386–422. Schaefer, E. (2004). Multi-faceted Rasch analysis and native-English-speaker ratings of Japanese EFL essays (Doctoral dissertation, Temple University, Japan, 2004). Dissertation Abstracts International, AAT (UMI Number: 3125553). Shizuka, T., Takeuchi, O., Yashima, T., & Yoshizawa, K. (2006). A comparison of three- and four-option English tests for university entrance selection purposes in Japan. Language Testing, 23(1), 35–57. Smith, E., Wakely, M., deKruif, R., & Swartz, C. (2003). Optimizing rating scales for selfefficacy (and other) research. Education and Psychological Measurement, 63(3), 369– 391.
78
Weaver, C. (2005, May). How university entrance examination scores can inform more than just admission decisions. Paper presented at the Japanese Association of Language Teachers Testing and Evaluation special interest group annual conference, Tokyo, Japan. Weaver, C., & Romanko, R. (2005). Assessing oral communicative competence in a university entrance examination. The Language Teacher, 29(1), 3–9. Weaver, C., & Sato, Y. (2005). Kobetsu-gakuryoku-kensa (eigo) no Rasch bunseki [A Rasch analysis of the English section of a university entrance examination]. The Journal of Research on University Entrance Exams, 15, 147–153. Wright, B., & Masters, G. (1982). Rating scale analysis. Chicago: MESA Press.
79
80
Detecting DIF across Different Language and Gender Groups in the MELAB Essay Test using the Logistic Regression Method Taejoon Park Teachers College, Columbia University
This study has investigated differential item functioning (DIF) on the writing subtest of the Michigan English Language Assessment Battery (MELAB). A total of ten writing prompts were examined using a three-step logistic regression procedure for ordinal items. An English Language Ability (ELA) variable was created by summing the standardized MELAB Grammar, Cloze, Vocabulary, Reading (GCVR), and Listening scale scores. This ELA variable was used to match examinees of different language (Indo-European vs. non–Indo-European) and gender (male vs. female) groups. The results of this study showed that although a few prompts were initially flagged because of statistically significant group effects, the effect sizes were far too small for any of those flagged prompts to be classified as having an important group effect.
That a test should not be biased is an important consideration in the selection and use of any psychological test. That is, it is essential that a test is fair to all applicants, and is not biased against a subgroup of the test-taker population. Bias can result in systematic errors that distort the inferences made from test scores. Test bias occurs when items contain sources of difficulty that are irrelevant or extraneous to the construct or ability being measured, and these extraneous or irrelevant factors affect performance (Zumbo, 1999). This issue of test bias has been the subject of a great deal of recent research in the field of educational measurement, and a statistical procedure called Differential Item Functioning (DIF) analysis has become an indispensable part of test development and validation. In the context of the impromptu essay test, when a prompt or topic is biased against a certain subgroup(s) of examinees, it could potentially distort the meaning of the essay score for different examinee subgroups and become a potential threat to the validity of score interpretations (Sheppard, 1982). For this reason, DIF screening must become an indispensable part of test development and validation. MELAB Essay Test Direct assessments of writing ability such as the MELAB essay test have considerable appeal because, unlike their less direct, multiple-choice counterparts, they actually require examinees to write, not merely to recognize the conventions of standard written English. Along with their appeal, however, these tests carry a special burden that does not encumber traditional multiple-choice tests. That is, they typically consist of a single writing task, which limits the generalizability of the results and may disadvantage test takers who happen to have little interest or background in the assigned topics (Weigle, 2000). Thus, it is incumbent upon test developers to devise essay prompts that are fair to all examinees and as comparable to one another as possible. The aim is that no test takers will be unfairly disadvantaged by being
Spaan Fellow Working Papers in Second or Foreign Language Assessment, Volume 4, 2006 English Language Institute, University of Michigan
81
administered a prompt whose content is so unfamiliar or unengaging as to hinder the demonstration of their writing ability. In the MELAB essay test, which is the first part of the battery, examinees are given 30 minutes to write on a single topic they choose from a set of two topics. At any given time approximately 25 sets of topics (or 50 individual topics) are being used as MELAB prompts. Since only one essay prompt is administered per examinee in the composition section, it becomes a very important issue to ensure that each prompt is as fair as possible to any subgroups of examinees who take the same writing prompt from the pool of prompts. The MELAB program already does much to ensure the fairness of the writing prompts. According to the MELAB technical manual (English Language Institute, 2003, pp. 33–34): Topic writers are advised to consider various constraints of the test situation, expectations about the writing outcome, and various characteristics of the examinees when formulating possible MELAB writing topics. Because it is expected that the text produced by the examinee will be at least 150 words long, prompt writers are advised to develop topics that are broad enough that someone can write at length. Certain topics are avoided, specifically those that might elicit formulaic or previously prepared responses . . . . Characteristics of the examinees that are considered when devising topics are that the examinees come from a range of linguistic and cultural backgrounds, have various educational backgrounds, and are different ages. Topics are developed to be accessible and attractive to a range of young adult and adult examinees. Topics are avoided that could be considered politically or culturally objectionable or limited, or that require the examinee to draw on specialized knowledge of a culture, a field, or a discipline. Topics may call upon the personal experience, attitudes, or general knowledge of the examinees. Differential Item Functioning (DIF) DIF addresses the issue of differential validity across groups by examining the performance of two groups of interest after the groups have been matched on some criterion (Dorans & Holland, 1993). According to O’Neill and McPeek (1993), “The fundamental principle of DIF is simple: Examinees who know the same amount about a topic should perform equally well on an item testing that topic regardless of their sex, race, or ethnicity” (p. 256). DIF occurs when examinees of equal ability, but with different group membership, have unequal probabilities of success on an item (Angoff, 1993; Clauser & Mazor, 1998). As described above, in order to minimize the likelihood of this situation, test developers attempt to craft prompts that are as nearly equivalent as possible and thus, to the extent possible, ensure all essay prompts function similarly for all test takers. In this way, any between-group difference in performance on a prompt will be due to construct-relevant factors rather than to influences that are irrelevant to the assessment of writing ability. In addition, though, it is desirable to use statistical techniques (i.e., DIF procedures) for investigating potential bias, especially in high-stakes testing situations. The identification of a satisfactory DIF procedure for essay prompts is not an easy undertaking for a number of reasons. For multiple-choice items, well-established methods exist for detecting items that are differentially difficult for certain subgroups of test takers. However, there are currently no entirely satisfactory, well-researched comparable procedures 82
for determining whether essay prompts are differentially difficult for matched subgroups of test takers. As pointed out by Lee, Breland, and Muraki (2002), one of the most important challenges in conducting DIF on essay tests is to find an appropriate variable to use for matching examinees of two different language groups on their writing ability. In the DIF procedure, this overall matching must be accomplished before between-subgroup performance comparisons can be made for individual items. For standardized multiple-choice measures, the total score on the test typically serves this function. For direct writing assessments, a comparable internal matching criterion is not usually available because most high-stakes writing tests such as the MELAB essay test consist of only a single writing prompt. In such cases, the usual strategy is to use an external matching criterion such as scores on multiple-choice tests that measure similar knowledge, skills, or abilities (Lee, Breland, & Muraki, 2002). In the present study, an external matching variable was created by summing the standardized scale scores from the two multiple-choice subtests of the MELAB (Grammar, Cloze, Vocabulary, Reading [GCVR], and Listening) based on a recommendation by Penfield and Lam (2000). The underlying assumption is that if examinees have high ELA measured by two parts of the test combined, they should perform well overall on the essays, and vice versa (more detailed information on the matching variable is provided later in the Method section). Uniform and Nonuniform DIF An item shows uniform DIF when the performance of one group is always superior to another group for each ability level. All DIF procedures are capable of detecting uniform DIF. An item shows nonuniform DIF when the performance of one group is dependent upon ability level. Thus, nonuniform DIF displays an interaction between ability level and group membership. Because of this interaction, nonuniform DIF is much more difficult to interpret. The identification of nonuniform DIF in polytomous items may be more important than the identification of nonuniform DIF in dichotomous items (Spray & Miller, 1994). There are many possible group-by-response-by-score interactions that can manifest in the polytomous case. For a polytomous DIF detection method to be useful, the power for detecting such interactions should be sufficiently large. Many DIF procedures are not capable of detecting nonuniform DIF, and thus one criterion for judging the usefulness of DIF procedures is its ability to detect nonuniform DIF. DIF Detection Methods for Polytomous Items Although there are several statistical methods available for the detection of differentially functioning items that are scored dichotomously (e.g., IRT approaches, the Mantel-Haenszel statistic, and SIBTEST), these methods are not directly applicable to polytomously scored items. Recently, various methods of investigating DIF in polytomous items have been developed, including logistic regression (Zumbo, 1999), logistic discriminant function analysis (Miller & Spray, 1993), Polytomous IRT (Muraki, 1999), the Generalized Mantel-Haenszel procedure (Zwick, Donoghue, & Grima, 1993), the polytomous SIBTEST procedure (Chang, Mazzeo, & Roussos, 1995), and the standardization method (Dorans & Schmitt, 1991). Among these methods, however, those requiring an internal criterion (e.g., polytomous IRT and polytomous SIBTEST) are not feasible for this study. Moreover, methods such as the Generalized Mantel-Haenszel procedure and the standardization method
83
lack the power to detect nonuniform DIF (Miller & Spray, 1993), which may be even more important when dealing with polytomous items due to the multiple ways in which item scores interact with the total score (Spray & Miller, 1994). For the present study, the logistic regression DIF method was selected because of its ability: (1) to detect both uniform and nonuniform DIF, and (2) to supplement the statistical test with means by which the practical significance of DIF can be examined. The logistic regression method employed in this study is described in detail in the Method section. The primary purpose of this study is to investigate DIF across different language and gender groups on the MELAB essay test, using the logistic regression method. Method Sample Included in the original sample of data available for this study were 5991 examinees who responded to 75 different topics. Sixty-five prompts with small sample sizes (n < 140) were dropped from the analysis. Of the 2,269 examinees used for the logistic regression analyses, 686 were males and 1583 were females. A total of 575 examinees were categorized as an Indo-European language group and 1694 as a Non-Indo-European language group. The most frequently reported first language is Tagalog/Filipino (27.90%), followed by Cantonese/Mandarin (13.84%), Korean (10.31%), Farsi/Persian (4.72%), Malayalam (4.58%), Russian (4.23%), Spanish (2.69%), Arabic (2.42%), Urdu (2.12%), Japanese (2.03%), English (1.85%), Somali (1.85%), Hindi (1.63%), Tamil (1.41%), Other Asian (1.37%), Romanian (1.32%), Vietnamese (1.28%), Punjabi (1.15%), Indonesian (1,06%), Other-African (0.93%), Bengali (0.88%), Gujarati (0.88%), Amharic (0.71%), French (0.71%), Polish (0.62%), Turkish (0.53%), Thai (0.48%), Portuguese (0.44%), Albanian (0.40%), Bulgarian (0.35%), German (0.31%), Tibetan (0.31%), Ukrainian (0.31%), Croatian (0.22%), Ormo (0.22%), Serbian (0.22%), etc. Instruments Data analyzed included scores on the Writing, Grammar, Cloze, Vocabulary, Reading (GCVR), and Listening subtests of the MELAB. The MELAB Writing score is based on two independent readings and holistic ratings of the essay response on a ten-step scale (53, 57, 63, 67, 73, 77, 83, 87, 93, or 97). The Writing test score is basically the average of the two reader ratings, and a third reader is also used when the two initial ratings differ by more than one scale point (See Appendix A for descriptions of each score level). The Listening and GCVR subtests have a score range from 30 to 100 and from 15 to 100, respectively. A matching variable, named ELA score, was created by (a) taking all the examinees who took the same writing prompt; (b) standardizing the scale score of the Listening, and GCVR subtests separately based on the total examinee samples for a specific prompt; and (c) summing the standardized scores of the two subtests for each examinee. One might argue that the GCVR score alone (i.e., without the Listening scores combined) could be a more valid matching variable for the writing ability. However, when the scale scores from the two subtests (i.e., GCVR and Listening) were standardized and combined, the correlation between the essay score and the matching criterion was maximized. Thus, a decision was made to create a matching variable by summing the standardized scale scores from the two subtests for each of the prompts analyzed in this study.
84
Computer Programs Stata version 8.0 was used to conduct all the statistical analyses (e.g., descriptive statistics, independent samples t-tests, and ordinal logistic regression) used in this study. The Microsoft Excel database software program was used for the purpose of data management. Data Analyses T-tests Independent samples t-tests were computed using examinees’ essay raw scores as the independent variable and language backgrounds (Indo-European vs. non-Indo-European) and gender (male vs. female) as the independent variables. The assumption of homogeneous variance was determined through use of the Levene statistic. If the Levene statistic was significant, the t-test result assuming unequal variance was interpreted; if the Levene statistic was not significant, the t-test result assuming equal variance was interpreted. For all independent samples t-tests, an α = 0.05 was used and two-tailed results were examined. Practical significance of the results was determined by computing Cohen’s d (standardized mean difference effect size) and using the cutoffs of 0.2, 0.5. 0.8 as small, medium and large, respectively (Cohen, 1992). Logistic Regression for DIF French and Miller (1996) and Zumbo (1999) demonstrated that the logistic regression procedure (Hosmer & Lemeshow, 1989) could be extended to detect DIF in polytomous items. As pointed out by Lee, Breland, and Muraki (2002), logistic regression has two main advantages over linear regression. The first is that the dependent variable does not have to be continuous, unbounded, and measured on an interval or ratio scale. In the case of the MELAB essay test data, the dependent variable (the essay score) is discrete and bounded between 53 and 97. The second advantage is that it does not require a linear relationship between the dependent and independent variables. Thus, logistic regression allows for the investigation of the group membership effect on the dependent variable, whether the relationship between the dependent and the independent variables is linear or nonlinear. When a dependent variable is discrete and bounded while the independent variable is continuous, a nonlinear relation is likely to exist among the variables (Lee, Breland, & Muraki, 2005). For these reasons, the logistic regression method is most appropriate for the study. In the case of the MELAB essay test, each examinee’s essay is scored on an ordinal scale, and thus the ordinal logistic regression was used in this study. The ordinal logistic regression estimates the cumulative probability and describes the relationships between each variable of the model. For the detection of DIF, the full ordinal logistic regression model used in this study is as follows: logit [P (Y ≤ k)] = αk + b1(ELA) + b2(Group) + b3 (ELA*Group) where Y = the natural log of the odds ratio, k = 0, 1, 2 . . . m, where m is the number of categories in the ordinal scale, ELA is the matching variable used in this study, and ELA*Group is the matching variable by group membership interaction variable. More specifically, a three-step modeling process based on ordinal logistic regression (Zumbo, 1999) was used as the main method of analysis. That is, the ordinal logistic
85
regression analysis was conducted in the following three steps: step 1, only the matching variable or the conditioning variable (i.e., ELA scores) was entered into the regression equation, as in, logit [P (Y ≤ k)] = αk + b1(ELA); step 2, the group membership variable was entered into the regression equation, as in, logit [P (Y ≤ k)] = αk + b1(ELA) + b2(Group); and step 3 (i.e., the full model), the interaction term (i.e., English Language Ability-by-Group) was finally added to the regression equation, as in, logit [P (Y ≤ k)] = αk + b1(ELA) + b2(Group) + b3 (ELA*Group). Using the three-step modeling process described above, the logistic regression method compares the fit of the augmented model (entering additional variables hierarchically into the model) to that of the compact model. If the first augmented model including the “Group” variable fits the data, this suggests that the prompt shows DIF due to group membership. That is, if the null hypothesis of “b2 = 0” is rejected and “b3 = 0” is tenable, the prompt shows uniform DIF. Likewise, if the fully augmented model including the “ELA*Group” interaction variable fits the data, this indicates that both ELA and group membership contribute to DIF. In other words, the null hypothesis of “b3 = 0” is rejected, and nonuniform DIF is present in the prompt. In order to gauge the amount of the group difference (if any), p-values for the Chisquare test were used along with R2 effect size estimates, which provide information about the practical significance of DIF (Zumbo, 1999). In the present study, the uniform R2 effect size is basically an increased portion of R2 after entering the dummy language group variable into the ELA-only regression model (i.e., step-1 model). The nonuniform effect size is an increased portion of R2 after adding the interaction term in the step-2 model. The total effect size is the aggregate of the uniform and nonuniform effects. There seems to be a lack of agreement over just what constitutes small or negligible, moderate or medium, or large effects. Cohen (1988) considered R2 effect sizes of 0.02, 0.13, and 0.26 as “small,” “medium,” and “large” effect sizes, respectively, which can also be linked to the standardized group mean differences (i.e., Cohen’s d) of 0.20, 0.50, and 0.80 in standard deviation units. Zumbo (1999) suggested that, for an item to be classified as displaying DIF (i.e., an aggregate of uniform and nonuniform DIF), the 2-degree of freedom Chi-square test between steps 1 and 3 have to have a p-value less than or equal to 0.01 and the R2 difference between them should be at least 0.13. Zumbo’s classification scheme of the R2 values of 0.13 corresponds to a “medium” R2 effect size in Cohen’s standard. Jodoin and Gierl (2001) suggested that R2 differences of 0.035, 0.035 to 0.070, and greater than 0.070 be considered as “negligible,” “moderate,” and “large” effects. Results Descriptive Statistics Descriptive statistics of the ten prompts analyzed in this study are provided below to give a general overview of the score information for the comparison groups. Table 1 presents overall means and standard deviations of the raw essay and ELA scores for the Indo-European and Non-Indo-European language groups. Standardized mean differences (i.e., Cohen’s d) between the two groups are also provided. As shown in Table 1, the ELA and observed essay scores were higher for the IndoEuropean language group than for the non-Indo-European language group. The standardized mean difference in the MELAB essay score, 0.49, would be viewed as a “small” effect size
86
(Cohen, 1988). The standardized mean difference in the ELA score, 0.24, may also be viewed as a “small” effect size.
Table 1. Means, Standard Deviations, and Standardized Group Mean Differences (Cohen’s d) for Essay and ELA Scores for Indo-European (IE) and Non-Indo-European (NIE) Language Groups Variable/Language group Sample size Mean SD d MELAB essay score IE group 575 77.39 6.82 0.49 NIE groups 1694 74.40 5.87 English language ability IE group 575 0.18 1.03 0.24 NIE groups 1694 -0.06 0.97
Table 2 presents overall means and standard deviations of the observed essay and ELA scores for the male and female examinees. Standardized mean differences (i.e., Cohen’s d) between the two groups are also provided. As shown in Table 2 below, the observed essay scores were higher for the female group than for the male group. The standardized mean difference in the MELAB essay score, 0.12, would be viewed as a “small” effect size according to Cohen’s (1988) standard. The standardized mean difference in the ELA score was extremely small, as indicated by the almost identical group means.
Table 2. Means, Standard Deviations, and Standardized Group Mean Differences (Cohen’s d) for Essay and ELA Scores for Male and Female Examinees Variable/Gender Sample size Mean SD d MELAB essay score Male 686 74.65 6.46 0.12 Female 1583 75.39 6.16 English language ability Male 686 0.00 1.91 0.01 Female 1583 0.00 1.75
T-Tests As a preliminary analysis, independent samples t-tests were conducted to compare the means of the comparison groups. The results are shown in Tables 3 and 4 for the different language and gender groups, respectively.
87
Table 3. Means, Standard deviations, and Independent Samples T-Test Results by Language Group IE group NIE group Prompt Mean SD Mean SD t d 18 77.58 (n=57) 6.24 74.31 (n=109) 5.61 3.42* 0.55 25 78.15 (n=67) 5.87 76.04 (n=166) 5.50 2.59* 0.37 46 77.44 (n=81) 5.44 75.49 (n=284) 5.00 3.03* 0.38 48 77.37 (n=67) 5.97 74.99 (n=154) 5.72 2.80* 0.40 49 79.88 (n=78) 7.85 75.05 (n=236) 5.82 5.80* 0.75 51 76.77 (n=31) 7.93 74.30 (n=117) 5.98 1.90 0.38 52 75.93 (n=42) 6.77 73.44 (n=152) 6.13 2.27* 0.39 54 74.62 (n=47) 8.08 73.40 (n=172) 6.90 1.03 0.17 55 76.42 (n=31) 6.66 73.32 (n=130) 5.23 2.81* 0.55 56 77.19 (n=74) 7.11 72.49 (n=174) 6.14 5.25* 0.72 *p < 0.05, two-tailed.
Table 4. Means, Standard deviations, and Independent Samples T-Test Results by Gender Male group Female group Prompt Mean SD Mean SD t d 18 75.52 (n=44) 6.45 75.40 (n=122) 5.89 0.11 0.02 25 76.41 (n=72) 5.36 76.75 (n=161) 5.83 -0.41 0.06 46 74.69 (n=87) 5.47 76.31 (n=278) 5.01 -2.58* 0.31 48 75.75 (n=81) 5.66 75.69 (n=140) 6.03 0.07 0.01 49 74.91 (n=112) 7.20 76.99 (n=202) 6.31 -2.66* 0.31 51 75.87 (n=39) 6.48 74.44 (n=109) 6.48 1.18 0.22 52 73.70 (n=47) 7.70 74.07 (n=147) 5.86 -0.34 0.06 54 75.51 (n=41) 7.65 73.93 (n=178) 7.04 -1.14 0.19 55 74.34 (n=93) 5.87 73.32 (n=68) 5.31 1.13 0.18 56 72.13 (n=70) 6.63 74.58 (n=178) 6.73 -2.59* 0.36 *p < 0.05, two-tailed.
As shown in Table 3 above, for the two different language groups, significant differences in the means were found in eight out of the ten prompts analyzed. In terms of the standardized group mean score difference index (i.e., Cohen’s d), one, five, and four prompts had effect sizes considered small, medium, and large, respectively (Cohen, 1988). For the gender comparison groups, as presented in Table 4, significant differences in the means were found in three out of the ten prompts analyzed. Six prompts (18, 25, 48, 52, 54, and 55) had small effects, and the four remaining prompts (46, 49, 51, and 56) had medium effects. It should be noted that simple differences in mean scores on an item across different examinee subgroups is not evidence of bias or unfairness. In some cases, examinees from two different groups may actually differ in the ability of interest, and differences in item performance are to be expected. These results are referred to as item impact (Clauser & Mazor, 1998). In fact, the real fairness issue should be the extent to which DIF is present in any of the MELAB writing prompts. The results obtained from the logistic regression DIF method are provided below.
88
Logistic Regression DIF As described in the method section, a three-step modeling process based on ordinal logistic regression (Zumbo, 1999) was used as the main method of analysis. For example, the language group DIF results for prompt 25 can be summarized as follows: Step 1. Model with ELA only: χ2(1) = 112.96, R-squared = 0.1102 Step 2. Uniform DIF: χ2(1) = 113.97, R-squared = 0.1112 Step 3. Uniform and Nonuniform DIF: χ2(1) = 114.52, R-squared = 0.1118 Examining the difference between steps 1 and 3 above for the DIF test, the resulting statistics are χ2(2) = 1.57, p = 0.4568, and R-squared = 0.0016. Given a nonsignificant p-value of 0.4568 (p > 0.01), this prompt is not demonstrating DIF across different language groups. Besides, the R2 effect size measure (R2Δ), which provides information about the “practical” significance of DIF (Zumbo, 1999), is negligible even by Jodoin and Gierl’s (2001) conservative standard (R2Δ < 0.035). The results of the logistic regression DIF analysis for the language and gender comparison groups are summarized in Tables 5 and 6, respectively. As shown in Tables 5 and 6 below, although a few prompts analyzed had statistically significant group differences (p < 0.01), the R2 effect sizes (i.e., R2Δ in Tables 5 and 6) were far too small for any prompt to be classified as having a group effect either across different language groups or across gender.
Table 5. Model Comparisons for the 10 MELAB Prompts across Different Language Groups Difference Prompt
18
25
46
48
49
51
52
Model 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
-2 Log likelihood 83.61 91.98 91.99 112.96 113.97 114.52 150.93 153.86 155.91 105.79 109.03 109.17 174.48 181.74 189.43 76.86 87.75 88.97 95.69 105.34
R
Model comparison
χ
0.1125 0.1238 0.1238 0.1102 0.1112 0.1118 0.0987 0.1006 0.1019 0.1069 0.1102 0.1103 0.1185 0.1234 0.1286 0.1148 0.1311 0.1329 0.1085 0.1194
1 vs. 3 1 vs. 2 2 vs. 3 1 vs. 3 1 vs. 2 2 vs. 3 1 vs. 3 1 vs. 2 2 vs. 3 1 vs. 3 1 vs. 2 2 vs. 3 1 vs. 3 1 vs. 2 2 vs. 3 1 vs. 3 1 vs. 2 2 vs. 3 1 vs. 3 1 vs. 2
8.38 8.37 0.01 1.57 1.01 0.56 4.98 2.92 2.05 3.38 3.24 0.15 14.95 7.28 7.68 12.11 10.89 1.21 10.06 9.64
2
2
df
p
R2Δ
2 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1
0.0151 0.0038* 0.9161 0.4568 0.3151 0.4551 0.0831 0.0872 0.1521 0.1842 0.0720 0.7015 0.0006* 0.0070* 0.0056* 0.0023* 0.0010* 0.2707 0.0065* 0.0019*
0.0113 0.0113 0.0000 0.0016 0.0001 0.0006 0.0032 0.0019 0.0013 0.0034 0.0033 0.0001 0.0101 0.0049 0.0052 0.0181 0.0163 0.0018 0.0114 0.0109 89
Difference Model -2 Log R df p χ R2Δ comparison likelihood 3 105.75 0.1199 2 vs. 3 0.41 1 0.5209 0.0005 54 1 104.62 0.1024 1 vs. 3 5.11 2 0.0778 0.0049 2 109.13 0.1068 1 vs. 2 4.51 1 0.0337 0.0044 3 109.73 0.1073 2 vs. 3 0.60 1 0.4388 0.0005 55 1 75.29 0.1070 1 vs. 3 6.33 2 0.0422 0.0009 2 80.64 0.1146 1 vs. 2 5.35 1 0.0207 0.0076 3 81.62 0.1160 2 vs. 3 0.98 1 0.3212 0.0014 56 1 156.95 0.1336 1 vs. 3 16.08 2 0.0003* 0.0137 2 168.22 0.1432 1 vs. 2 11.27 1 0.0008* 0.0096 3 173.03 0.1473 2 vs. 3 4.81 1 0.0283 0.0041 *p < 0.01, Model 1 predictor: ELA, Model 2 predictors: ELA, Language group, Model 3 predictors: ELA, Language group, ELA*Language group. Prompt
2
Model
2
Note. Model comparison 1 vs. 3: Uniform + Nonuniform DIF Model comparison 1 vs. 2: Uniform DIF only Model comparison 2 vs. 3: Nonuniform DIF only Effect size measure (Jodoin & Gierl, 2001) R2Δ < 0.035: “negligible” effect 0.035 < R2Δ < 0.070: “moderate” effect R2Δ > 0.070: “large” effect
Table 6. Model Comparisons for the 10 MELAB Prompts across Gender Difference Prompt
18
25
46
48
49
90
Model 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
-2 Log likelihood 83.61 84.06 84.06 112.96 115.31 115.31 150.93 163.62 165.08 105.79 106.76 107.39 174.48 181.07 186.12
R2 0.1125 0.1131 0.1131 0.1102 0.1125 0.1125 0.0987 0.1070 0.1079 0.1069 0.1079 0.1085 0.1185 0.1229 0.1264
Model comparison
χ
1 vs. 3 1 vs. 2 2 vs. 3 1 vs. 3 1 vs. 2 2 vs. 3 1 vs. 3 1 vs. 2 2 vs. 3 1 vs. 3 1 vs. 2 2 vs. 3 1 vs. 3 1 vs. 2 2 vs. 3
0.45 0.45 0.00 2.36 2.35 0.01 14.15 12.69 1.46 1.59 0.97 0.63 11.65 6.59 5.06
2
df
p
R2Δ
2 1 1 2 1 1 2 1 1 2 1 1 2 1 1
0.7968 0.5004 0.9868 0.3075 0.1251 0.9341 0.0008* 0.0004* 0.2270 0.4505 0.3247 0.4292 0.0030* 0.0102 0.0245
0.0006 0.0006 0.0000 0.0023 0.0023 0.0000 0.0092 0.0083 0.0009 0.0016 0.0010 0.0006 0.0079 0.0044 0.0035
Difference Model -2 Log R df p χ R2Δ comparison likelihood 51 1 76.86 0.1148 1 vs. 3 2.91 2 0.2335 0.0044 2 79.47 0.1187 1 vs. 2 2.61 1 0.1064 0.0039 3 79.77 0.1192 2 vs. 3 0.30 1 0.5823 0.0005 52 1 95.69 0.1085 1 vs. 3 4.62 2 0.0994 0.0052 2 96.54 0.1095 1 vs. 2 0.85 1 0.3564 0.0010 3 100.31 0.1137 2 vs. 3 3.77 1 0.0523 0.0042 54 1 104.62 0.1024 1 vs. 3 10.64 2 0.0049* 0.0104 2 109.65 0.1073 1 vs. 2 5.02 1 0.0250 0.0049 3 115.26 0.1128 2 vs. 3 5.61 1 0.0178 0.0055 55 1 75.29 0.1070 1 vs. 3 1.91 2 0.3840 0.0028 2 75.97 0.1080 1 vs. 2 0.68 1 0.4108 0.0010 3 77.20 0.1098 2 vs. 3 1.24 1 0.2659 0.0018 56 1 156.95 0.1336 1 vs. 3 20.68 2 0.0000* 0.0177 2 176.31 0.1501 1 vs. 2 19.36 1 0.0000* 0.0165 3 177.63 0.1513 2 vs. 3 1.32 1 0.2507 0.0012 *p < 0.01, Model 1 predictor: ELA, Model 2 predictors: ELA, Language group, Model 3 predictors: ELA, Language group, ELA*Language group. Prompt
Model
2
2
Note. Model comparison 1 vs. 3: Uniform + Nonuniform DIF Model comparison 1 vs. 2: Uniform DIF only Model comparison 2 vs. 3: Nonuniform DIF only Effect size measure (Jodoin & Gierl, 2001) R2Δ < 0.035: “negligible” effect 0.035 < R2Δ < 0.070: “moderate” effect R2Δ > 0.070: “large” effect Discussion and Conclusion The number of tasks that can be feasibly administered in direct assessments of writing, such as in essay tests, is usually small because such formats of assessment require extended responses and are time consuming to administer and score (Powers & Fowles, 1999). Often only one writing prompt is administered to each examinee, as in the MELAB composition part. Under such circumstances, it is very important to ensure that each prompt is as fair as possible to examinee subgroups. The primary purpose of this study was to investigate DIF across different language and gender groups on the MELAB essay test, using the logistic regression method. A threestep modeling process based on ordinal logistic regression (Zumbo, 1999) employed in this study seemed to be efficient in investigating simultaneously both uniform and nonuniform group effects related to native languages and gender. In the tradition of logistic regression DIF tests, in this study, the term DIF is synonymous with the simultaneous test of uniform and
91
nonuniform DIF with a two-degree-of-freedom Chi-squared test. The statistical significant tests were supplemented with a measure of the corresponding effect size via R-squared. The results of this study showed that although a few prompts analyzed were initially flagged through the three-step logistic regression method due to the significant uniform and/or nonuniform group effects, their effect sizes were far too small for any of them to be classified as DIF essay items. That is, the essay score differences between the different groups compared in this study seem to be due to “item impact” rather than “group difference” attributable to a construct irrelevant factor inherent in writing prompts. A clear distinction is usually made between “item impact” and “DIF” in the item bias literature (Clause & Mazor, 1998; Penfield & Lam, 2000; Zumbo, 1999). “Item impact” may be present when examinees from different groups have different probabilities of success on an item, because examinees from these groups do actually differ in the ability of interest. In such circumstances, group differences in examinee performance on the item are to be expected because of “true” differences between the groups in the underlying ability being measured by the item. In general, it was believed that the logistic regression procedures worked well in this study. However, some limitations of the current study should be noted. One limitation was the ELA variable used in this study may not be an ideal matching variable. A better matching variable would have been a measure similar to the free-response writing prompts being studied. Since each examinee responds to a single writing prompt in the MELAB composition part, there was no similar matching variable available. The use of a multiple-choice measure such as ELA as a matching variable assumes that if examinees have high ELA measured by the two parts of the test as a whole, they should perform well overall on the essays and vice versa. An important question is whether similar effect sizes might have been obtained if a more direct measure of writing had been used as a matching variable. It may be possible to conduct research that would answer this question in the future. A second limitation is related to the sample size used in the present study. In the ordinal logistic regression procedure, accurate estimation of parameters depends on a healthy sample size in each score category. Considering the sample size used, therefore, the results of this study should be interpreted with caution. An important question is whether similar results could have been obtained if larger sample sizes per group (e.g., n > 200) had been used in the logistic regression procedure. In conclusion, even though “judgmental” methods that rely on expert judges’ opinions can be used to select potentially biased prompts, it is a relatively subjective procedure. Thus, it is also recommended that “statistical” techniques (i.e., DIF procedures) be routinely implemented to identify differentially functioning items for various comparison groups. Prompt developers can benefit from routinely identifying prompts through statistical quality control and then reviewing those that are identified as extreme. Acknowledgement I would like to extend my sincere gratitude to the English Language Institute of the University of Michigan for funding this research project.
92
References Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3–23). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Chang, H., Mazzeo, J., & Roussos, L. A. (1995). Detecting DIF for polytomously scored items: An adaptation of the SIBTEST procedure (ETS Research Report No. 95–5). Princeton, NJ: Educational Testing Service. Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17, 31–44. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159. Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 137–166). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Dorans, N. J., & Schmitt, A. J. (1991). Constructed response and differential item functioning: A pragmatic approach (ETS Research Report No. 91–47). Princeton, NJ: Educational Testing Service. English Language Institute, University of Michigan. (2003). MELAB technical manual. Ann Arbor, MI: English Language Institute, University of Michigan. French, A. W., & Miller, T. R. (1996). Logistic regression and its use in detecting differential item functioning in polytomous items. Journal of Educational Measurement, 33, 315– 332. Hosmer, D. W., & Lemeshow, S. (1989). Applied logistic regression. New York: Wiley. Jodoin, M. G., & Gierl, M. J. (2001). Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329–349. Lee, Y., Breland, H., & Muraki, E. (April, 2002). Comparability of TOEFL CBT essay prompts for different native language groups. Paper presented at the annual conference of National Council on Measurement in Education, New Orleans, LA. Miller, T. R., & Spray, J. A. (1993). Logistic discriminant function analysis for DIF identification of polytomously scored items. Journal of Educational Measurement, 30, 107–122. Muraki, E. (1999). Stepwise analysis of differential item functioning based on multiple group partial credit model. Journal of Educational Measurement, 36, 217–232. O’Neill, K. A., & McPeek, W. M. (1993). Item and test characteristics that are associated with differential item functioning. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 255–276). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Penfield, R. D., & Lam, T. C. M. (2000). Assessing differential item functioning in performance assessment: Review and recommendations. Educational Measurement: Issues and Practice, 19, 5–15. Powers, D. E., & Fowles, M. E. (1999). Test-takers’ judgments of essay prompts: Perceptions and performance. Educational Assessment, 6, 3–22. Sheppard, L. A. (1982). Definition of bias. In R. Berk (Ed.), Handbook of methods for detecting test bias (pp. 9–30). Baltimore, MA: Johns Hopkins University.
93
Spray, J., & Miller, T. (1994). Identifying non-uniform DIF in polytomously scored test items. (Research Report No. 93–1). Iowa City, IA: American College Testing Program. Weigle, S. C. (2000). Test review: The Michigan English language assessment battery (MELAB). Language Testing, 17, 449–455. Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, Ontario, Canada: Directorate of Human Resources Research and Evaluation, Department of National Defense. Zwick, R., Donoghue, J.R., & Grima, A. (1993). Assessment of differential item functioning for performance task. Journal of Educational Measurement, 30, 233–251.
94
Appendix A MELAB Composition Rating Scale 97 Topic is richly and fully developed. Flexible use of a wide range of syntactic (sentence level) structures, accurate morphological (word forms) control. Organization is appropriate and effective, and there is excellent control of connection. There is a wide range of appropriately used vocabulary. Spelling and punctuation appear error free. 93 Topic is fully and complexly developed. Flexible use of a wide range of syntactic structures. Morphological control is nearly always accurate. Organization is well controlled and appropriate to the material, and the writing is well connected. Vocabulary is broad and appropriately used. Spelling and punctuation errors are not distracting. 87 Topic is well developed, with acknowledgement of its complexity. Varied syntactic structures are used with some flexibility, and there is good morphological control. Organization is controlled and generally appropriate to the material, and there are few problems with connection. Vocabulary is broad and usually used appropriately. Spelling and punctuation errors are not distracting. 83 Topic is generally clearly and completely developed, with at least some acknowledgement of its complexity. Both simple and complex syntactic structures are generally adequately used; there is adequate morphological control. Organization is controlled and shows some appropriacy to the material, and connection is usually adequate. Vocabulary use shows some flexibility, and is usually appropriate. Spelling and punctuation errors are sometimes distracting. 77 Topic is developed clearly but not completely and without acknowledging its complexity. Both simple and complex syntactic structures are present; in some "77" essays these are cautiously and accurately used while in others there is more fluency and less accuracy. Morphological control is inconsistent. Organization is generally controlled, while connection is sometimes absent or unsuccessful. Vocabulary is adequate, but may sometimes be inappropriately used. Spelling and punctuation errors are sometimes distracting. 73 Topic development is present, although limited by incompleteness, lack of clarity, or lack of focus. The topic may be treated as though it has only one dimension, or only one point of view is possible. In some "73" essays both simple and complex syntactic structures are present, but with many errors; others have accurate syntax but are very restricted in the range of language attempted. Morphological control is inconsistent. Organization is partially controlled, while connection is often absent or unsuccessful. Vocabulary is sometimes inadequate, and sometimes inappropriately used. Spelling and punctuation errors are sometimes distracting. 67 Topic development is present but restricted, and often incomplete or unclear. Simple syntactic structures dominate, with many errors; complex syntactic structures, if present, are not controlled. Lacks morphological control. Organization, when apparent, is poorly
95
controlled, and little or no connection is apparent. Narrow and simple vocabulary usually approximates meaning but is often inappropriately used. Spelling and punctuation errors are often distracting. 63 Contains little sign of topic development. Simple syntactic structures are present, but with many errors; lacks morphological control. There is little or no organization, and no connection apparent. Narrow and simple vocabulary inhibits communication, and spelling and punctuation errors often cause serious interference. 57 Often extremely short; contains only fragmentary communication about the topic. There is little syntactic or morphological control, and no organization or connection are apparent. Vocabulary is highly restricted and inaccurately used. Spelling is often indecipherable and punctuation is missing or appears random. 53 Extremely short, usually about 40 words or less; communicates nothing, and is often copied directly from the prompt. There is little sign of syntactic or morphological control, and no apparent organization or connection. Vocabulary is extremely restricted and repetitively used. Spelling is often indecipherable and punctuation is missing or appears random. N.O.T. (Not On Topic) indicates a composition written on a topic completely different from any of those assigned; it does not indicate that a writer has merely digressed from or misinterpreted a topic. N.O.T. compositions often appear prepared and memorized. They are not assigned scores or codes.
96
Bias Revisited Liz Hamp-Lyons University of Hong Kong Alan Davies University of Edinburgh
The increasing use of international tests of English proficiency (for example TOEFL, TOEIC, IELTS, and MELAB), indicative of the continuing worldwide spread of the English language, has been condemned on the grounds that such tests are biased or unfair (unfair in the sense that a test favouring boys could be said to be unfair to girls). Commentators on the global spread of English disagree as to which norms should be employed, whether they should be exonormative or endonormative. An International English (IE) view recognises only one norm, that of the educated native speaker of English (acknowledging, of course, that there are somewhat distinct norms for British, American, Australian, Irish, Canadian, New Zealand and South African varieties). A strong World Englishes (WEs) view maintains that to impose IE on users of WEs may be discriminatory against non-native English speakers, arguing that local standards are already in place (for example India, and Singapore). Davies, Hamp-Lyons and Kemp (2003) raise three bias-related questions in relation to the IE/WEs question: (1) how possible is it to distinguish between an error and a token of a new type? (2) if we could establish bias, how much would it really matter? and (3) does an international English test privilege those with a metropolitan Anglophone education? In this report we describe an empirical investigation, funded by a Spaan Fellowship, in which we compare ratings on written essays forming part of the MELAB test in order to investigate the above questions. Six groups of essays (each N = 10), each group belonging to a different first language (L1), were provided, already MELAB scored, by the University of Michigan Testing and Certificate Division. The researchers then identified six pairs of raters, each pair having the same L1 as one of the groups under test. Each pair of raters rated 30 essays (their own and two other language sets). Our preliminary results are ambiguous with regard to bias and we have identified the need to refine, replicate, and expand our study.
Holders of both the International English (IE) view and the World Englishes (WEs) view accept the existence of variation within English but disagree on the role and status of language norms (Bartsch, 1988; Davies, 1999). The IE view is strengthened by the strict view of norm acquisition, viz. that it needs a large enough body of native speakers to take on its responsibility (Davies, 2003; but see Graddol, 1997; 2005). The WEs view refers to a belief in and a respect for multiple varieties of English. Each of these varieties takes and adapts some parent form of English into a stable dialect that is Spaan Fellow Working Papers in Second or Foreign Language Assessment, Volume 4, 2006 English Language Institute, University of Michigan
97
not only “correct” in its home milieu, but may for many of its users be the only form of the spoken language they hear (Kachru, 1992). The core value of World Englishes is that “the English language now belongs to all those who use it” (Kachru, 1988, p. 1). In the WEs view, there is no single “right” English, no such thing as one “native speaker norm.” The WEs view is strengthened by the empirical study of language acquisition which shows that normal language development is “haphazard and largely below the consciousness of speakers” (Hudson, 1996, p. 32) so that norm acquisition occurs only where the dominant influence on speakers is of the “normal” form. The WEs position springs from a world view in which native speakers of a highly dominant language, such as English, have the responsibility to give serious attention to the harm—linguistic, cultural, social, and economic—its spread may do. However, many EFL learners have a real practical need for a test to provide international recognition of their English proficiency, for example for certification, university entry, employment, immigration, and so on. The strong postcolonial view of the role and standards of English raises difficulties for the assessment of English language proficiency in the “expanding circle” of Englishes, because no one local standard can replace an international standard in an international test. Although the authors of this report come to the debate on the role of IE and WEs from different perspectives, they agree that both IE and WEs have important roles, often quite separate ones. They further agree that it would make little sense to argue for the use of a local WE in an international test, and they further share the view that in a local test use situation a local WE standard is appropriate if that is what local stakeholders want. This research is concerned solely with internationally used English proficiency testing in high-stakes contexts; for example TOEFL, TOEIC, and IELTS (Criper & Davies, 1988; Spolsky, 1993; Clapham, 1996). Although there are several important issues to be debated, in particular whose norms are to be imposed in the test materials and specifications, and what the consequences are for test takers if the norm imposed by the test is not the variety—with its own norms—they have acquired in their own society, this research concentrates only on an empirical search for evidence that language performances will be scored differently by raters from different backgrounds. Many other valid questions might also be asked; for example, Lowenberg (1993) carried out an analysis of the TOEIC test focusing on the possibility of bias in the test materials themselves. He concludes: . . . the brief analysis presented in this paper is sufficient to call into question the validity of certain features of English posited as being globally normative in tests of English as an international language, such as TOEIC, and even more in the preparation of materials that have developed around these tests. Granted, only a relatively small proportion of the questions on the actual tests deal with these nativized features: most test items reflect the “common core” of norms which comprise Standard English in all non-native and native speaker varieties. But given the importance usually attributed to numerical scores in the assessment of language proficiency, only two or three items of questionable validity on a form could jeopardize the ranking of candidates in a competitive test administration. (p. 104) Lowenberg challenges “the assumption held by many who design such English proficiency tests . . . that native speakers still should determine the norms for Standard
98
English around the world” (p. 104). Most recently he has followed up his earlier work with an analysis of newspaper style sheets, government documents and ESL textbooks in Malaysia, Singapore, Brunei, and the Philippines and found that these diverge from native speaker varieties at all levels, from the morphosyntactic and lexical to pragmatic and discoursal conventions (Lowenberg, 2002). One way to avoid using the global norms to which Lowenberg objects is to investigate to what extent local norms are appropriate both locally and beyond the local, and then use this information in test development. Such an investigation is reported by Hill (1996) and Brown and Lumley (1998), both referring to the development of an English proficiency test for Indonesian teachers of English. Hill comments: . . . the majority of English learners will use English to communicate with other non-native speakers within South-East Asia. For this reason it was decided the test should emphasize the ability to communicate effectively in English as it is used in the region, rather than relate proficiency to the norms of America, Britain or Australia . . . this approach also aims to recognize the Indonesian variety of English both as an appropriate model to be provided by teachers and as a valid target for learners. (1996, p. 323) Brown and Lumley (1998, p. 94) describe the aims of this test: • • • •
the judicious selection of tasks relevant to teachers of English in Indonesia; the selection of culturally appropriate content; an emphasis on assessing test takers in relation to local norms; the use of local raters, that is non-native speakers of English (whose proficiency was nevertheless of a high standard);
and they claim that these aims were all fulfilled. However, no bias research was carried out. Prequel to this Study In 2002 we carried out a research project in Hong Kong where we were both based at the time. We began with the hypothesis that international English tests are biased: by that we mean that they systematically misrepresent the “true scores” of candidates by requiring facility in a variety of English to which whole groups of candidates have not been exposed. Bias, therefore, is not about difference as such but about unfair difference. The argument about bias on international English tests is that these tests represent the old colonial Standard English of the United Kingdom, the United States, etc., a kind of English that is not known or only partly known to many of those who have learned English as an additional language, in particular those living in one of the so-called New English societies that have adopted a local or locally emerging variety of English—societies such as Singapore, Malaysia, and India. In other words the argument is between IE and WEs. We intended to undertake an empirical study, but in the event this was not possible and in its place we held an invited seminar in Hong Kong, described in Davies, Hamp-Lyons, and Kemp (2003), with representatives from Singapore, China, India, and Malaysia. The purpose of the seminar was to compare local tests of English used in those four countries with international tests of English through close textual analysis. We
99
concluded that what is at issue in comparing international and local tests of English proficiency is which standard is under test. The question then becomes: does a WEs variety “own” (in the sense of accept) a standard of its own which it appeals to in a local test? If not, then the assumption is that speakers of this WE will be required to operate in the testing situation in the IE standard (assuming that society accepts a single IE norm). That is precisely the point made strongly in the Hong Kong seminar by Lukmani (India). But are such WEs speakers being discriminated against in being required to do this? That is an empirical question, and forms part of the present study. The Hong Kong study concluded with three questions, and we continued to debate these issues back and forth between us: (1) how possible is it to distinguish between an error and a token of a new type? (2) if we could establish bias, how much would it really matter? and (3) does an international English test privilege those with a metropolitan Anglophone education? We agreed that inherent to the whole debate are questions of beliefs and judgements; therefore, in a further study it would be appropriate to collect judgemental data that might illuminate bias if it exists, and perhaps to indicate where bias might, or might not, lie. The Current Study With funding from a Spaan Fellowship (University of Michigan English Language Institute) we have conducted an empirical investigation examining a range of essays written by writers from six different language backgrounds, drawn from the MELAB database (Michigan English Language Assessment Battery). As the raw data for the study, we have gathered ratings of these essays by (a) two native speakers of the writer’s first language; (b) two pairs of two raters from non-native speaker backgrounds other than the writer’s first language; (c) original score data from MELAB raters. By examining writing tests, specifically the judgements made of a range of writers’ performances by different categories of raters, this work follows the lead of Hamp-Lyons (1986) and Hamp-Lyons and Zhang (2001), but looks at quantitative data rather than qualitative judgements about text characteristics. Data and Results The data in the present study consists of a set of 10 essays from test takers from each of the following language backgrounds: Arabic, Bahasa Indonesia/Malay, Japanese, Mandarin Chinese, Tamil, and Yoruba. All essays were written on the same or very similar topic from the MELAB pool. Each essay received the same or closely similar scores from two MELAB raters who were native speakers of standard American English. We used the average of two raters’ scores as the dependent variable in this dataset, and within each language set we obtained a range of MELAB score levels, not including the very lowest levels. In MELAB scoring, raters who are native speakers of Standard (American) Englishi use a 10-point scale with the following score labels: 53 (Low), 57, 63, 67, 73, 77, 83, 87, 93, 97 (High). When both readers give the same score, that is the score the essay gets. When the readers are one point away from each other (e.g., 57, 63), the essay gets the average of these two scores (60). In our dataset, all essays fell into these two categories, so none were read more than twice. The assumption of this study is
100
that if there is for any group significant deviation from the MELAB scores, this could be an indication of bias and therefore worth investigating textually. Our definition of “deviation” was pre-set at p < 0.05. The essay sets were then rated by pairs of raters: each pair shared the same language background as one of the six sets of candidates, thus there were two native speaking Japanese, two native speaking Bahasa Indonesian/Malay, etc. We hypothesized that if there were bias, it would be reflected in a significant difference between our native speaking raters’ ratings and the MELAB scores. We calculated correlations between the MELAB ratings for the Japanese writers’ essays and the Japanese raters of their native groups’ essays, and repeated this analysis for the Bahasa set, etc. We were also interested in a further comparison, whether or not there are any measurable differences across raters from different (IE and WEs) backgrounds that remain after internal inconsistencies in raters’ responses are removed. If we found any such patterns on statistical analyses, we looked closely at the essays to try to identify consistent reasons (such as backgroundrelated reasons, rater bias factors, etc.) Our reasoning for selecting these language backgrounds was more cultural and social than linguistic. We considered establishing a language-distance scale (Davies and Elder, 1997), the assumption being that languages closer linguistically to English (for example German would be very close) were more likely to accept IE norms, while those closer culturally and socially because of a shared imperial and colonial past (for example India) were more likely to reject those norms. However, apart from the difficulty of establishing a language-distance scale, given the complexity of the variables involved and their interaction, we considered that language distance was less likely to be a critical factor than cultural “distance.” In other words, the societies within which English is widely used are likely to be more influenced culturally, socially, and politically by IE societies and are more likely to reject an International English standard norm. (This is their response to what they perceive to be the hegemonic continuation of colonialism and now of globalization). We recognize that this is very much an embryonic hypothesis and so we expect—and welcome—discussion on this point. However, with that in mind and with all the caveats, we established the following tentative scale for +/- English (see Table 1):
Table 1. Hypothesized Scale of Language/Cultural “Distance” + English No Clear Basis - English Tamil Arabic Chinese Yoruba Bahasa Japanese
Our reasoning was as follows: Tamil (India and Sri Lanka) and Yoruba (Nigeria) are placed closer to English culturally and socially because of their former colonial status. Bahasa and Arabic are placed together in the middle ground on the basis that while the Bahasa essays were in Bahasa Malaysia, their raters were Indonesian users of Bahasa; and Egypt, with Egyptian Arabic raters, has a mixed history of accommodation with the anglophone West. None of these rater sets, we concluded, was free of potentially confounding influence. Finally, China (Mandarin Chinese) is the furthest away
101
traditionally from English influence, not colonized and until recently not directly connected. Japanese is a difficult case: Japanese is linguistically far from English, and in many ways Japanese culture is inimical to Western ways; however, there are also spheres of very strong cultural influence from the United States on Japan; both Japanese raters were attached to one of the English-medium Tokyo universities and may well have received part of their education in the United States. One of the apparently inescapable confounds in our attempts to research these questions is the tight interweaving of multiple and sometimes contradictory influences on the use of and attitudes toward English norms in each country. Ratings As explained above, MELAB average scores fall in a range between 53 and 97—a 19-point scale. For simplicity, we asked our untrained raters to use a simplified form of the TOEFL writing scale—a 6-point scale which when averaged has 9 points between 2 and 6. We therefore needed to make a judgmental adjustment between the scales in order to establish the cut points for making “equivalence.” (In retrospect, we would have our set of essays rescored by MELAB raters but using the same simple scale as our other raters.) We had three sets of hypotheses: 1. An IE hypothesis: that the scores for all subsets of essays would correlate with the MELAB raters’ scores at the same or very similar levels (i.e., there would be no significant differences between any of the pairs of raters and the MELAB scores). To answer this question, we proceeded as follows. For each subset there were two raters: thus there were two Japanese raters, two Yoruba raters, and so on. The two ratings in each subset were summed, thus if Candidate A in the Japanese subset had ratings from the Japanese raters of 6 and 5, his/her rating for the purpose of correlation was assumed to be 11. Product moment correlations were then calculated between the summed ratings for each subset (excluding the Arabic where there was missing data), using only ratings by the two L1 raters for each subset, and their MELAB scores. 2. A weak WEs hypothesis: that some pairs of ratings would be less closely correlated with MELAB scores than others, and that this difference would not be explainable by inter-rater unreliability. According to this hypothesis, those at the +English end of the scale would be least likely to agree with the MELAB scores on the grounds that those most dominated culturally or economically, would be most likely to reject exonormative norms, and vice versa at the -English end of the scale. 3. A strong WEs hypothesis: that there would be greater differences between the scores of pairs of raters on their own language background essays and any other pair of raters of the same essay than between the same rater pair and the MELAB essays (that is, that the cultural/linguistic distance is greater between two nonstandard varieties than between any one of them and a standard variety). Findings Before we could answer any of our questions, we needed to know how reliable the averaged scores from each pair of raters were. Recent estimates of the interrater r
102
(Pearson with Spearman-Brown prophecy formula) of the MELAB averaged scores, from all essay ratings in 2004, was 0.84 (Johnson, 2005). It can be seen that the scores of our two Tamils raters were so divergent that there is no consistent relationship and therefore any statements we might make based on them would be dubious at best (see Table 2). Japanese and Chinese are the most stable and so we can be most confident about our findings on hypotheses 2 and 3 for these languages: however, with so few data points we proceed only with caution.
Table 2. Interrater Reliability language reliability Arabic one rater did not return scored essays Bahasa .446* Chinese .733** Japanese .747** Tamil .270 Yoruba .498* * p < .01, ** p < .005
Hypothesis 1 is an IE hypothesis: that the scores for all subsets of essays would correlate with the MELAB raters’ scores at the same or very similar levels. What do we find when we compare in-country raters rating their own L1 candidates and the MELAB scores for the same candidates? Simple correlations are shown in Table 3.
Table 3. Matched S/R In-Country Data vs. MELAB Ratings + English No Clear Basis language language r r Tamil .704*# Bahasa .839* Yoruba .307 Arabic --* p < .05, # Tamil raters were not interreliable
- English language Chinese Japanese
r .506 .708*
Table 3 shows that the scores for all subsets of essays do not correlate with the MELAB raters’ scores at the same or very similar levels, and thus the IE hypothesis is not supported. Hypothesis 2 is a weak WEs hypothesis: that some pairs of ratings would be less closely correlated with MELAB scores than others, and that this difference would not be explainable by interrater unreliability. We must begin by excluding the Tamil data from this analysis due to unreliability of the Tamil ratings, and one Arabic rater did not return the scores. This leaves us four sets of observations, as shown in Table 4.
103
Table 4. MELAB by L1 Ratings on each Essay Set language Japanese raters of Japanese essays vs. MELAB raters Bahasa raters of Bahasa essays vs. MELAB raters Chinese raters of Chinese essays vs. MELAB raters Yoruba raters of Yoruba essays vs. MELAB raters
r .501 (p = .007) .771 (p = .009) .682 (p = .030) .589 (p = .073)
We see that Japanese and Bahasa are strongly related to the MELAB scores; these happen also to be the most interrater-reliable scores. Chinese is weakly but significantly related to MELAB scores, while Yoruba scores have no statistically significant relationship. The WEs hypothesis appears to be upheld. However, we must treat this result with skepticism because of the very small datasets and the many intervening variables we have already referred to. Turning to hypothesis 3 (the strong WEs position), that the cultural/linguistic distance is greater between two nonstandard varieties than between any one of them and a standard variety: to explore this we needed at least some model of distance between the languages/cultures represented by our data sets. In setting up the matchings for rater-pair assignments to essay/language sets, Yoruba was paired with Tamil because of the shared colonial history. Arabic was hypothesized to be closer to Bahasa than to the other languages because of the Islamic/Koranic influence in both cases; Chinese and Japanese were seen as close because of the shared character set and the high frequency of word and concepts in common, although this pairing is questionable because the cultural distance between each of these languages and (North American) cultural values and knowledge differs considerably. In the third set of analyses, each language set is compared with the language it is closest to and one other—except in the case of Yoruba, where because our view shifted during the study we have ended up with it paired against both languages in the mid-point of the scale: with a little more time we will be able to obtain scores on the Yoruba essays from the Chinese rater pair, in order to complete the balance of the data.
Table 5. Rough Hypothesis for Linguistic/Cultural “Distance” of each Language from a Standard American English (IE) Norm language closer language farther language Bahasa Arabic Tamil Chinese Japanese Yoruba Japanese Chinese Yoruba Tamil Yoruba Japanese Yoruba Arabic Bahasa Language-by-Language Discussion Bahasa In Table 6 we see that Bahasa raters (although from a different geopolitical region than the writers) agree very closely with the Yoruba raters and the MELAB raters, but not with the Tamil raters. Because we have already noted the unreliability of the Tamil
104
ratings, we should disregard this result; the other two are too similar for the strong WEs hypothesis to be upheld in this case.
Table 6. Bahasa Rater Correlations language Bahasa raters of Bahasa essays vs. MELAB raters Bahasa raters of Bahasa essays vs. Tamil raters Bahasa raters of Bahasa essays vs. African raters
r .771 (p = .009) .362 (p = .304) .901 (p = .000)
Chinese As seen in Table 7, all the interactions are significant for the Chinese essays: the strong WEs hypothesis cannot be upheld in this case.
Table 7. Chinese Rater Correlations language Chinese raters of Chinese essays vs. MELAB raters Chinese raters of Chinese essays vs. Japanese raters Chinese raters of Chinese essays vs. African raters
r .682 (p = .031) .692 (p = .027) .643 (p = .045)
Japanese Excluding the Tamil, we see that although the correlation between MELAB and the L1 Japanese raters was significant, between MELAB and the “third language” (Chinese) pair, it was not (Table 8). This finding appears to uphold the strong WEs hypothesis.
Table 8. Japanese Rater Correlations language Japanese raters of Japanese essays vs. MELAB raters Japanese raters of Japanese essays vs. Chinese raters Japanese raters of Japanese essays vs. Tamil raters
r .501 (p = .007) .180 (p = .140) .688 (p = .720)
Yoruba There are no significant interactions in this set (Table 9) although Yoruba and MELAB come closest; we would be stretching things too far to claim evidence here for the strong WEs hypothesis.
105
Table 9. Yoruba Rater Correlations language Yoruba raters of Yoruba essays vs. MELAB raters Yoruba raters of Yoruba essays vs. Japanese raters Yoruba raters of Yoruba essays vs. Tamil raters
r .589 (p = .031) .060 (p = .027) .207 (p = .045)
We are left with only one data point that appears to support the strong WEs hypothesis; however, this occurs in the case of Japanese, which is one of the most ambiguous language/culture contexts to pin down in terms of “distance” from an IE (Standard American English) norm. Discussion The results we have reported are weak and conform only patchily at best to our hypotheses. Obviously, our data set is much too small to come to any conclusions. The intervening variables, obscuring our view of what is there, are many and incestuous: the sample; the uncertainty about candidates’ L1 (not all “Yoruba” were Yoruba); the lack of fit between the raters and how far they shared their “compatriots” L1; the varying degrees of “naiveté” of raters concerning (a) the role of English in their own culture, and (b) the theory of bias; the lack of training of raters and the worry that if they were trained they would become ciphers of the IE we want them to problematize; our failure to use just one rating scale; and so on. However, this pilot (or even pre-pilot) is, we suggest, worth extending. What we would hope to do is to limit our sets of essays to perhaps four L1s, two of the Tamil and Yoruba type, and two of the Chinese type. Each set should have an N of 50+ and there should be two sets of raters, four in each set, one +NS, the other –NNS (and again with a shared L1 with each set). That would mean a group of four +NS and a group of 16 – NS raters. All raters should use the same scale. Half in each set should be trained. We should be very explicit in the demographic profile that raters need. Both issues we have confronted, WEs and bias, are fugitive. Nevertheless, their pursuit through analysis of test data does afford the possibility of coming nearer to our quarries. Bias on the basis of our study may be “not proven” but it cannot be dismissed. Acknowledgements We would like to thank our research assistant, Aishah binte Jantan, who worked with all our far-flung raters and handled all our data; and Jeff Johnson of the University of Michigan ELI who identified the data sets and supported our research. References Bartsch, R. (1988). Norms of Language: Theoretical and practical aspects. London: Longman
106
Brown, A., & T. Lumley. (1998). Linguistic and cultural norms in language testing: A case study. Melbourne Papers in Language Testing, 7(1), 80–96. Clapham, C. (1996). The Development of IELTS: A study of the effect of background knowledge on reading comprehension. Cambridge, U.K.: University of Cambridge Local Examinations Syndicate and Cambridge University Press (Studies in Language Testing, Volume 4). Criper, C., & A. Davies. (1988). ELTS validation project report (Research Report 1/1). London and Cambridge: The British Council and University of Cambridge Local Examinations Syndicate. Davies, A. (1999). An introduction to applied linguistics: From practice to theory. Edinburgh, U.K.: Edinburgh University Press. Davies, A. (2003). The native speaker: Myth and reality. Clevedon: Multilingual Matters. Davies, A., & Elder, C. (1997). Language distance as a factor in the acquisition of literacy in English as a Second Language. In P. McKay, P., (Ed.). The bilingual interface project (pp. 93–108). Canberra: The Commonwealth of Australia. Davies, A., Hamp-Lyons, L., & Kemp, C. (2003). Whose norms? International proficiency tests in English. World Englishes, 22(4), 571–584. Elder, C., & Davies, A. (1998). Performance in ESL examinations: Is there a language distance effect? Language and Education, 11, 1–17. Graddol, D. (1997). The future of English. London: British Council. Graddol, D. (2005, May). The future of English and its assessment. Paper presented at the Association of Language Testers in Europe Conference, Berlin. Hamp-Lyons, L. (1986). Writing in a foreign language and rhetorical transfer: Influences on evaluators’ ratings. In P. Meara (Ed.), British Series in Applied Linguistics 1: Selected Papers from the 1985 Annual Meeting (pp. 72–84). London: CILT. Hamp-Lyons, L., & Zhang, W-X. (2001). World Englishes: Issues in and from academic writing assessment. In J. Flowerdew & M. Peacock (Eds.), English for academic purposes: Research perspectives (pp.101–116). Cambridge: Cambridge University Press. Hill, K. (1996). Who should be the judge? The use of non-native speakers as raters on a test of English as an international language. Melbourne Papers in Language Testing, 5(2), 29–50. Hudson, R. (1996). Sociolinguistics. 2nd Ed. Cambridge, U.K.: Cambridge University Press. Johnson, J. S. (2005). MELAB 2004: Descriptive statistics and reliability estimates. Research Reports 2005–03. Ann Arbor, MI: English Language Institute, University of Michigan. Kachru, B. B. (1988). Teaching world Englishes. ERIC/CLL News Bulletin, 12(1), 1,34,8. Kachru, B. (Ed.) (1992). The other tongue: English across cultures. Urbana: University of Illinois Press. Lowenberg, P. (2002, December). Southeast Asian norms: Implications for assessing proficiency in English as a global language. Paper presented at the Association Internationale de Linguistique Appliquee Conference, Singapore. Lowenberg, P. (1993). Issues of validity in tests of English as a world language: whose standards? World Englishes, 12(1), 95–106.
107
Nunan, D. (2003). The Impact of English as a global language on educational policies and practices in the Asia-Pacific region. TESOL Quarterly, 37(4), 589–613. Spolsky, B. (1993). Testing across cultures: An historical perspective. World Englishes, 12(1), 87–93. i
Editor’s note: All MELAB essay raters are not native speakers of English, but they were for the data in this study.
108