Language Testing

Language Testing http://ltj.sagepub.com

An exploratory study into the construct validity of a reading comprehension test: triangulation of data sources Neil J. Anderson, Lyle Bachman, Kyle Perkins and Andrew Cohen Language Testing 1991; 8; 41 DOI: 10.1177/026553229100800104 The online version of this article can be found at: http://ltj.sagepub.com/cgi/content/abstract/8/1/41

Published by: http://www.sagepublications.com

Additional services and information for Language Testing can be found at: Email Alerts: http://ltj.sagepub.com/cgi/alerts Subscriptions: http://ltj.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations (this article cites 18 articles hosted on the SAGE Journals Online and HighWire Press platforms): http://ltj.sagepub.com/cgi/content/refs/8/1/41

Downloaded from http://ltj.sagepub.com at CALIFORNIA DIGITAL LIBRARY on May 23, 2008 © 1991 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.

.41-

exploratory study into the construct validity of a reading comprehension test: triangulation of An

data

sources

Neil J. Anderson Ohio University, Lyle Bachman University of California, Los Angeles, Kyle Perkins Southern Illinois University at Carbondale and Andrew Cohen Hebrew University at Jerusalem Recent research in reading comprehension has focused on the processes of reading, while recent thinking in language testing has recognized the importance of gathering information on test taking processes as part of construct validation. And while there is a growing body of research on test-taking strategies in language testing, as well as research into the relationship between item content and item performance, no research to date has attempted to examine the relationships among all three - test taking strategies, item content and item performance. This study thus serves as a methodological exploration in the use of information from both think-aloud protocols and more commonly used types of information on test content and test performance in the investigation of construct validity.

I Introduction

Recent studies into reading have moved away from a focus on product to investigating the reading process in order to better define the construct of reading comprehension (Farr, Pritchard, and Smitten, 1990; Pritchard, 1990). But while models of reading have evolved, changing our thinking about how the printed word is understood, the tests that we use to measure that understanding have not changed significantly. It would thus appear that an examination of the construct validity of current reading tests, vis-d-vis current reading theories, is in order. Traditional approaches to construct validation would appear to be inadequate for this purpose, in that they largely ignore the processes that test-takers employ in taking tests, focusing on the content of the tests themselves and the products - item or test scores and the relationships among these - of whatever processes may be involved in test-taking. However, recent thinking in educational measurement (e.g., Duran, 1989; Farr, Pritchard and Smitten, 1990; Messick, 1989) and in language testing (Bachman, 1990; Grotjahn, 1986) has begun to recognize the necessity of including, in the investigation of construct validity, information about how test


42 takers go about processing test tasks, and of relating this information to information on test content and test performance. The purpose of this paper is to present the results of an exploratory study that examined three types of information in the investigation into the construct validity of a reading comprehension test. Specifically, this study was designed to examine the processes that second language learners use while taking a reading comprehension test and to relate that information to both the content of reading comprehension test items and to test-takers’ performance on those items. This research was part of a larger study which examined the use of reading strategies by language learners in two reading contexts: taking reading comprehension tests and reading for academic purposes (Anderson,

1989). During the past 70 years, educators have used standardized reading tests to measure comprehension (Segel, 1986). A number of techniques, such as open-ended questions, cloze, true/false, sentence completion, summary and multiple-choice have been developed to assess reading comprehension (Cohen, 1980). The most common of these testing methods used in standardized reading tests is the multiple-choice format. This format has frequently been criticized because the correct answer can be reached in more than one way and can often be identified ’without actually understanding the text and without any judgemental activity in selecting the correct response’ (Klein-Braley, cited in Nevo, 1989). The process of reaching the correct answer on a reading comprehension test thus may not reflect the processes involved in actual reading contexts. Since reading comprehension tests were first developed, they have not varied significantly in their format. During this same period of time, researchers have proposed and tested a variety of models of how the printed word is understood. These studies have moved away from a focus on the product of reading to a focus on the reading process in order to better define the construct of reading comprehension. Gordon (1987) points out that investigations of the reading process frequently use reading comprehension tests as the research tool. Learning and success in reading are often equated with a high score on a test

(Fransson, 1984).

Discussions of test validation generally include some reference to construct validation, the process of determining the relationship between test performance and the construct, or ability, it is intended to measure. Hughes ( 1989: 27) indicates that ’construct validation is a research activity, the means by which theories are put to the test and are confirmed, modified, or abandoned. It is through construct validation that language testing can be put on a sounder, more scientific footing’. But while the importance of construct validation is now


43

widely recognized by language testers, its formulation as a unitary concept, encompassing the three ’traditional’ types of validity (content, concurrent and predictive) is just beginning to find its way into the language testing literature (Bachman, 1990). The notion of construct validity as a unitary concept has been developed largely by Messick (1975, 1981, 1989), and is now widely accepted in the field of measurement (American Psychological Association, 1985). Consistent with this is the view of construct validation as the ongoing process of gathering different types of evidence in support of a given interpretation of a test score. Traditional types of evidence include information on content relevance, concurrent relatedness and predictive utility. More recently, psychometricians have recognized the importance of gathering information on testtaking processes as part of construct validation. Thus Messick, reviewing the research on the use of problem-solving strategies in tests, reports that ’different individuals performed the same task in different ways and ... the same individual might perform in a different manner across items or on different occasions’ (Messick, 1989: 54). Duran (1989) cites cross-cultural research indicating that differences in cultural background affect the ways in which individuals perceive problem-solving situations. He suggests that these differences, as well as differences in problem-solving styles, may affect the performance of linguistic minorities on language tests. Douglas and Selinker (1985: 218) have pointed to discourse domains within interlanguage as a source of individual differences in test performance, arguing that ’the validity of a particular test as a test will be limited by the extent to which it engages discourse domains which learners have already created in their [interlanguage] use’. In light of considerations such as these, Messick offers a rather dim prospect for construct validation: different things for different people [and] this being the case, the notion that a test score reflects a single uniform construct interbecomes illusory. Indeed, that a test’s construct interpretation pretation is a major might need to vary from one type of person to another conundrum in educational and psychological measurement. (1989: 55) test scores may mean ...

...

Verbal reports or think-aloud protocols have been used in many research studies as a method of getting at the mental processes that language learners use (Afflerbach and Johnston, 1984; Block, 1986; Cohen, 1986, 1987a, 1987b, 1987c, 1987d, in press; Cohen and Hosenfeld, 1981; Ericsson and Simon, 1984; Faerch and Kasper, 1987; Farr, Pritchard and Smitten, 1990; Garner, 1982, 1988; Nevo, 1989; Pritchard, 1990; Sarig, 1987). A think-aloud protocol is produced when a reader verbalizes his or her thought processes while


44 a given task. Strategies are a deliberate, cognitive action be accessed for a conscious report (Paris, Lipson and Wixson, 1983). With respect to the construct of reading comprehension it is known that ’direct assessment of the ... trait is impossible since it is a mental operation which is unobservable’ (Gordon, 1987: 5). Using think-aloud protocols is a way of getting at the unobservable behaviour of reading comprehension. Cohen (1986, 1987a) and Alderson (1984) issued a call for more research in the area of second language acquisition that uses think-aloud protocols as a method of tapping the mental processes that L2 learners use. Alderson, critiquing forms of comprehension testing indicated:

completing that

can

Such information (test data collected through multiple choice or cloze tests) provides no insight into how the reader has arrived at his interpretation, be it at a level of detail, main idea, inferred meaning, or evaluation judgement.... The think-aloud technique is one such possibility. (1984: 21-22) ...

Recently, language testing researchers have begun to investigate test-taking processes, or strategies. Cohen, for example, has reported on L2 learner strategies employed in cloze and multiple-choice tests (Cohen, 1984) and in summarization tasks in foreign language reading (Cohen, forthcoming), while Alderson (no date) has investigated self-reported strategies in reading comprehension tests. Cohen has also reported on L2 learner strategies for dealing with teacher feedback on their compositions (1987c). Grotjahn (1986) has extended this line of research to include the logical task analysis of test items. Nevo’s (1989) research compared test-taking strategies used on multiple-choice tests in L1 and L2. Anderson’s (1989) research compared L2 readers’ strategies used during reading comprehension tests with those used in academic reading situations. Pritchard (1990) utilized think-aloud protocols to examine the strategies that readers use while reading culturally familiar and unfamiliar materials. His results indicate that cultural schemata influence the processing strategies readers used and the level of comprehension they achieved. Concern for content relevance and coverage (content validity) is certainly not new to language testing. Attempts to systematically investigate the relationship between test content and test performance, however, are relatively recent, and the findings are not entirely consistent. Alderson (1988, 1990) and colleagues (Alderson and Lukmani, 1989; Alderson, Henning and Lukmani, 1987) found virtually no relationship between what ’experts’ judged language test items to be measuring and characteristics (difficulty, discrimination) of items. Perkins and Linville (1987), on the other hand, found that word length, frequency, abstractness, and the semantic feature


45

’evaluation’ accounted for the majority of the variance in performance on EFL vocabulary test items. Bachman, Davidson, Lynch, and Ryan (1989) found that 68% of the variance in item difficulties on the TOEFL reading comprehension subtest could be accounted for by test content facets related to grammar and to the academic and topical content of reading items, and that 60% of the variance in item discrimination indices could be accounted for by facets related to topic and cohesive devices. At the same time, the majority of their facets were not related to either item difficulty or discrimination. In summary, recent research in reading comprehension has focused on the processes of reading, while recent thinking in language testing has recognized the importance of gathering information on testtaking processes as part of construct validation. And while there is a growing body of research on test-taking strategies in language testing, as well as research into the relationship between item content and item performance, no research to date has attempted to examine the relationships among all three - test-taking strategies, item content and item performance. This study thus serves as a methodological exploration in the use of information from think-aloud protocols and more commonly used types of information on test content and test performance in the investigation of construct validity. II Method 7

Participants The participants for this research were selected from the population of Spanish-speaking students enrolled at a university level intensive English as a second language (ESL) programme in the southwestern United States. Of the 65 Spanish-speaking students enrolled in the programme, 28 completed all phases of the research. Eighteen of the participants were male and 10 were female. They came from various countries of Central and South America. The amount of time that the participants had studied in the United States ranged from nine weeks to nine months. Their ages ranged from 18 to 34 years. Of the 28 participants involved in this study, nine were classified as beginning level students, 10 as intermediate level and nine as advanced level according to the placement test administered by the intensive ESL programme. 2 Materials

Material for the study consisted of the Descriptive Test of Language Skills -

Reading Comprehension Test, (DTLS) Forms

A and B


46

(Educational Testing Service, 1977), which is a standardized reading comprehension test consisting of 15 reading passages varying in length from 44 to 135 words, each followed by two to four multiplechoice comprehension questions for a total of 45 questions. The test takes 30 minutes to administer. The passages are written in a variety of styles about a variety of topics. The comprehension questions have been grouped into clusters according to the type of reading skill that is being measured: understanding main ideas, understanding direct statements, or drawing inferences about the material in each of the short passages. The authors indicate that the selection of the test questions was carefully ’designed to insure that questions that tested different kinds of skills within each area were appropriately represented in the test’ (Educational Testing Service, 1985: 44). (See Appendix A for an example passage with examples of each question type.) Information available from the publisher indicates that the tests are designed for differentiating among lower-reading-proficiency students. The test forms were normed on student populations throughout the United States attending two- and four- year colleges and are similar to other standardized reading tests. It is used at many colleges in the United States as a screening and placement test for native and non-native English speaking students in remedial and ESL reading programs (Block, 1986; Segel, 1986). The Kuder-Richardson Formula 20 reliability coefficient for Form A of the DTLS - Reading Comprehension section is .89 (Educational Testing Service, 1985: 43). Reliability information is not available for Form B of the test. The publisher also provides data from studies designed to examine the content and criterion validity of the DTLS Reading Comprehension test (Educational Testing Service, 1985: 44-45). Criterion validity was established through an examination of ’the relationship of DTLS scores to other measures of language skills: end-of-term grades in writing courses and scores on essays written especially for the study’ (Educational Testing Service, 1985: 44). 3 Procedure

participants were randomly assigned to two groups, with one group taking Form A of the DTLS and the other taking Form B on the first day of testing. As the purpose of this first reading task was to assess participant’s reading comprehension skills in a typical standardized test setting, the DTLS was administered under standard operational conditions. Following the administration of the first form of the DTLS each participant was introduced to think-aloud protocol procedures and provided an opportunity to practice. At this time individual The


47

also made with each student to complete the second administration of the DTLS, during which they would complete think-aloud protocols. The second administration of the test, with subjects who took Form A on the first administration taking Form B, and vice versa, took place approximately one month from the first administration. The purpose of this phase of data collection was to have the participants verbalize the strategies utilized while reading and answering the comprehension questions during the standardized test. In order to preserve the integrity of the timing of the testing condition, which is integral to a reading comprehension test, and at the same time get the participants to verbalize the strategies they were using during the exam, the testing conditions were modified. Readers might employ different strategies if allowed to take the test without the constraint of time limitations. Participants were told they would have 30 minutes to complete as much of the test as possible and that after reading and answering the comprehension questions for each passage they were to say, ’stop’. The exam time was then suspended and participants were asked to describe the reading and test-taking strategies they had used while reading the passage and answering the comprehension questions. Following the think-aloud protocol for that portion of the test, the time was then restarted as they continued to read the next passage and answer the questions. Participants were to indicate when they had completed the next passage. The time was suspended a second time allowing them to describe their reading and testing strategies. The test was administered in this way until a total elapsed testing time of 30 minutes had passed. Participants were allowed to produce think-aloud protocols in their L 1 (Spanish), their L2 (English) or both.

appointments

4 Data

were

Analysis

Several types of data analysis were performed, providing both qualitative and quantitative data. Each data analysis task reflects part of the triangulation of data sources to examine the construct validity of the reading comprehension test. The first task involved reviewing each of the think-aloud protocols and coding them for the use of reading and test-taking strategies. Pritchard’s (1990) inventory of

reading processing strategies was used as a starting point in classifying the reading strategy data and Nevo’s (1989) Multiple-Choice Strategy Checklist was used as a starting point for classifying the test-taking strategy data reported. Additional strategies that were reported by the L2 readers in this research that were not accounted for in either Pritchard’s or Nevo’s list were categorized and added to the list of


48 In addition to identifying additional strategies, the superordinate category labels in the Pritchard taxonomy were modified. The new strategy categories were adapted from Cohen (1989) which were in turn adapted from Sarig (1987). Table 1 provides the list of 47 strategies that were used for classifying the data in this study, grouped into five categories. After the think-aloud protocols were transcribed and coded, and the additional processing categories developed, the reliability of the assignment of strategies to the various categories was investigated. Two independent raters were given the list of 47 processing strategies with definitions and examples of each. Ten think-aloud protocols were randomly selected and the raters independently classified the reported strategies using the 47 categories. After each rater completed classifying the data in the ten think-aloud protocols, their classifications were compared with those of the researcher and the percent of interrater agreement was calculated. The percent of total agreement across all three raters, who rated a total of 479 strategies, was 74%. The ratings of the most incongruent rater were eliminated and the percent of agreement across two raters resulted in an increase to 80 Vo. These percentages reflect the number of times that the raters agreed on the exact categorization of each of the 479 strategies. The second data analysis task involved the content analysis of each test item on both forms of the test from two perspectives: that of the test designer and of Pearson and Johnson’s (1978) taxonomy of relationships between texts and test items. Information available from the publisher categorizes each question into one of the following aspects of reading comprehension: (1) understanding main ideas (30 items on both forms A and B), (2) understanding direct statements (34 items), or (3) drawing inferences (26 items). Using Pearson and Johnson’s (1978) taxonomy of question and answer relationship, items were classified as having one of the following relationships: (1) textually explicit - if both the question and the answer are derivable from the text and if the relationship between the question and the answer was explicitly cued by the language of the text (22 items on both forms A and B), (2) textually implicit - if both the question and the answer are derivable from the text but there is no logical or grammatical cues tying the question to the answer and the answer given is plausible in light of the question (38 items), or (3) scriptally implicit - whenever a plausible non-textual response is given to a question derivable from the text (30 items). Third, the data from the think-aloud protocols were then submitted to chi-squared analyses. First, the frequencies of reported strategies were compared across question types as determined by the test developers (main idea, direct statement, and inference), then the

strategies.


Neil J. Anderson et al.,


49

50

compared across the Pearson and Johnson quesrelationship categories (textually explicit, textually implicit, and scriptally implicit). These chi-square analyses provide data as to whether the strategies reported by the students were significantly different according to the question type. Because some of the reported frequencies for a given strategy were low ( < 10), only those strategies that have reported frequencies of > 10 for any one of the item types (main idea, inference, direct statement or textually explicit, textually implicit, scriptally implicit) were included in the chi-square analysis. Using this method of analysis resulted in examining in detail only 17 of the 47 reported strategies. The next data analysis task consisted of examining test performance data. The comprehension questions from each of the administrations of the standardized reading test were scored and the percentage score was recorded for each student. Item difficulty (p)

frequencies

tion and

were

answer

and discrimination (r,,i) were obtained. In order to examine the relationship between strategy use of item difficulty, as well as the relationship between strategy use and acceptable discrimination, item statistics were submitted to chi-square analyses. The item difficulty (p) values were categorized into three groups: easy items p > .67; average items 33 s p :5 .67; and difficult items p < .33. Point biserial data were categorized into two groups: acceptable items r pbi > .25; unacceptable items rpbi .25), with the remaining 28 items falling below the .25 cut-off point. A chi-square analysis was calculated to examine the relationship between frequency of strategy use and item discrimination. These results are given in Table 6 indicating that there is a significant relationship between strategy use and item discrimination (X2 37.7624, df 16, p < .0016). Once again, because the overall chi-square statistic indicated a significant relationship, individual chi-square coefficients were calculated for all 17 strategies resulting in three of them showing significance. A final chi-square was calculated to evaluate the relationship between the classification of test item types according to the test developers (main idea, direct statement, and inference) and item difficulty (easy, acceptable, and difficult). The chi-square statistic was 4.42029, df = 4, p = NS, indicating that there is not a significant relationship between the test item type and level of item difficulty. =

=

IV Discussion

The discussion will centre around the information that is gained by using a triangulation approach to examining the construct validation of this reading comprehension test. Recall that this approach brings together several sources of information; in this study there were: data from the participants’ retrospective think-aloud protocols of their reading and test taking strategies, data from a content analysis of the


55 Table 5

p k

Reported strategy frequencies and chi-square statistics for

level of item

difficulty

level of item difficulty number of test items in each category = 57.7659, df 32, p < 0.0035

= =

X2

=


56 Table 6

r pbl k

Reported strategy frequencies

for level of item discrimination

point biserial correlation number of test items in each category = 37.7624, df = 16, p < 0.0016 =

=

X2


57

reading comprehension passages and questions, and test performance data. By examining the data from these three sources we are better able to understand the interactions among learner strategies, test content and test performance. Six of the 17 strategies listed in Table 2 provide insights into the strategies that the test takers used during this reading comprehension test. The chi-square analyses indicate that these subjects are using these strategies differently, depending on the type of question that is being asked. The second most frequently reported strategy (213 occurrences), trying to match the stem with the text (strategy 34), was reported more frequently (96 occurrences) for inference type questions than for the other question types (direct statement, 73 occurrences; main idea, 44 occurrences). Stating failure to understand a portion of the text (strategy 3, 146 total occurrences) occurred more frequently on test items that are intended to measure inference (61 occurrences) and less frequently on the other two item types (direct statement, 51 occurrences; main idea, 34 occurrences). Strategy 19, paraphrasing, occurred a total of 104 times, with more occurrences of this strategy reported for direct statement items (43 occurrences) than for inference and main idea question types (30 and 31 occurrences respectively). The strategy of guessing (strategy 30, 27 occurrences) occurred more frequently on the inference question type than the other question types (direct statement, 16 occurrences; main idea, 10 occurrences). There were fewer occurrences of the strategy of rereading (strategy 23) on items that are designed to test direct statements of text (eight occurrences), while there were more occurrences of this strategy on items designed to test identification of main idea (22 occurrences) and the ability to make inferences (15 occurrences). Finally, there were fewer references to the time allocation (strategy 37) on items designed to test inference (five occurrences) than on items designed to test main idea and direct statement (13 and 14 occurrences respectively). The results of the chi-square analysis also show that the subjects tended to use some strategies fairly consistently for the different types of items. The readers reported using six of 17 strategies consistently across the three question types of main idea, inference, and direct statement. This indicates that there is no relationship between these strategies and the purpose of the test question as determined by the test developers. There is not a statistically significant difference between item type and strategy use on the remaining 11 items. With the exception of one strategy (35, to be discussed below) we would not expect to see the reader/test taker using these strategies differently based on question type. For example, test item type should not influence making reference to lexical items that impede comprehension (strategy 8),


58

responding affectively to text content (strategy 11), skimming (strategy 14), scanning (strategy 15), extrapolating information from the text (strategy 21), reacting to the author’s style or text’s surface structure (strategy 25), selecting an answer through elimination (strategy 32), selecting an answer through deductive reasoning (strategy 33), selecting an answer based on understanding the text (strategy 36), or expressing uncertainty at the correctness of an answer (strategy 43). The one strategy that seems somewhat puzzling is strategy 35, selecting an answer because it is stated in the text. The readers/test takers reported using this strategy as frequently on direct statement questions as on main idea and inference type questions. We would expect the use of this strategy for direct statement questions. But for main idea and inference type questions, the information will not be stated explicitly in the text, thus the readers should not be using this strategy. Perhaps they are trying to find the information but are not able to find it. When the content of the items is examined through a different paradigm (e.g., the Pearson and Johnson question and answer relationships) there does not seem to be a significant relationship between the types of strategies that the readers/test takers used and the classification of test items. There does appear to be a relationship between these two systems of classification of test questions (as shown by the chi-square results of Table 4) yet the frequency of strategies that readers reported using to answer the questions shows that there is not a significant relationship between strategy use and question type when viewed through the Pearson and Johnson categories. A discussion of where the differences occur between these two content analysis methods is in order. In evaluating the question types across these two frameworks (see Table 4) the expected outcomes were generally supported: direct statement items tended to be textually explicit (16 of 26 items). Inference items related to textually implicit and scriptally implicit items (31 of 34 items). Items testing main ideas should have some level of agreement with the Pearson and Johnson categories of textually implicit items and scriptally implicit items (27 of 30 items). The results of three comparisons - textually implicit/direct statement (nine items), scriptally implicit/direct statement (one item), and textually explicit/main idea (three items) - were not expected. If an item was a direct statement of material presented in the text then the items should not be textually or scriptally implicit. The Pearson and Johnson categories were designed to determine the relationship between a reading comprehension question and the reading passage: is the answer explicitly or implicitly stated in the text, or is the answer scriptally based? An artifact is that the Pearson and Johnson


59

categories are independent of the question categories developed by the test designers. For example, the answer to ’What is the main idea of the passage?’ could be textually explicit, textually implicit, or scriptally implicit. For these reasons the results of the chi-squared analyses - significant differences in the frequency of strategy use on the question categories developed by the test designers but not on the Pearson and Johnson question and answer relationships - are not all that surprising. The test performance data, combined with the think-aloud protocol data provide additional insights into what these second language readers are doing during this reading comprehension test. For four of the 17 strategies listed, significantly fewer occurrences of the strategy are reported on the ten items that are classified as easy. These strategies include: guessing (strategy 30), selecting an answer not because it was thought to be correct, but because the others did not seem reasonable, seemed similar, or were not understandable (strategy 32), matching the stem and/or alternatives to a previous portion of the text (strategy 34), and selecting a response based on understanding the material read (strategy 36). For each of these strategies it is clear to see that there would be fewer occurences on test items that are easy. Less guessing, less need to select a response through elimination, less need to match the test question stem with the text, and less need to state that the response was selected based on understanding the text each seem appropriate on test items that are easy. For the remaining five items that result in a statistically significant relationship on the individual chi-square statistic, the use of the strategy results more frequently on items that have an acceptable level of difficulty. These include responding effectively to the text content (strategy 11, 35 occurrences), skimming (strategy 14, 27 occurrences), paraphrasing (strategy 19, 61 occurrences), selecting an answer because it is stated in the text (strategy 35, 29 occurrences), and making reference to time allocation (strategy 37, 15 occurrences). For these nine items that show a statistically significant relationship on the chi-square analysis, the difficult items do not seem to elicit a high frequency of strategies. This is attributed to the fact that not all of the students completed all of the exam and many of the difficult items occurred at the end of the exam thus fewer students reported on what they did to answer these questions. The test performance data on item discrimination, in conjunction with the think-aloud protocol data continues to provide a better view of what the readers in this study are doing during this reading comprehension test. For three of the 17 reported strategies, significantly fewer occurrences were reported for items classified as unacceptable according to the point biserial correlations: stating


60

failure to understand (strategy 3), scanning reading material for a specific word or phrase (strategy 15), and selecting a response based on understanding the material read (strategy 36). Keeping in mind the value of the triangulation approach to the evaluation the construct validity of this reading comprehension test, it is important to note that there was not a significant relationship between the test item types as determined by the test developers and the level of item difficulty. One item type was not easier, more acceptable or more difficult that the other two item types. The greatest insight into the concept of construct validation of reading comprehension is gained as we now combine the information from all three sources. This insight is particularly valuable in terms of a discussion on five strategies in particular: stating failure to understand (strategy 3), paraphrasing (strategy 19), guessing (strategy 30), matching the stem with a previous portion of the text (strategy 34), and making reference to time allocations (strategy 37). Stating failure to understand occurred more frequently on test items that were directed at getting the reader to make an inference (61 occurrences). This strategy was used fewer times on easy items (11 occurrences) and more times (83 occurrences) on items that discriminated well among those readers who scored high overall on the test. The thrust of this strategy lies in the reader’s ability to monitor reading comprehension. On an easy item a reader is not likely to have difficulty understanding, thus not needing to state failure to understand. Likewise, on items that discriminate well between those scoring high on the test and those scoring low, fewer occurrences of stating failure to understand would result. Paraphrasing (strategy 19) occurred more frequently on items directed at getting the reader to identify the direct statement of the passage. The use of this strategy occured more times on items classified as acceptable in terms of discrimination. The ability to paraphrase appears to be a strategy that the readers employ when answering acceptable difficulty level test items. Guessing (strategy 30) is reported more times on inference items (27 occurrences), fewer times on easy items (one occurrence), and occurs about as often on acceptable and rejected items (33 and 20 occurrences respectively). Perhaps readers are doing more guessing on multiple-choice test questions than we think they are. Matching the stem with a previous portion of the text (strategy 34) is reported fewer times on items directed at identifying the main idea, reported fewer times on easy items, and reported fewer times on acceptable items. Yet there are many occurrences (96, see Table 2) of this strategy reported for items that are designed to test a reader’s ability to make inferences. This seems to indicate that the readers are


61

trying to match the stem with a previous portion of the text on items that are designed not to be answered directly from the text but from the reader’s ability to make inferences. Also, because this strategy is used so frequently (140 reported occurrences, see Table 6) on items that fall below the .25 cut-off for discrimination, readers scoring well overall on the test are applying this strategy on items that are being missed.

Finally, making references to time allocations (strategy 37) is reported fewer times on inference questions and more times on items that are classified as acceptable. The fact that these test-takers make reference to the amount of time that is allocated for completing test questions tells us that perhaps we should consider the amount of time we require test takers to answer test questions in. V

Implications for further

research

Continued research needs to be directed at the use of more than data source in the evaluation of reading comprehension tests. Through the use of a triangulation approach to test evaluation, test developers can better determine if the test passages and test items are performing as they are intended to perform. Of concern to the second language classroom teacher is how readers should be taught to take standardized tests so that their scores will more appropriately reflect their students language abilities. The list of strategies in this study do not represent a pedagogical curriculum for the classroom. The classroom teacher needs to examine an approach of identifying what the readers are actually doing and determine if they are successful in their application of reading strategies. Additional research into the teaching of reading and test taking strategies needs to be investigated. one

VI Conclusion

This

has approached the use of various data in the process of examining the construct validation of a reading comprehension test. Perhaps the greatest insight gained from this investigation is that more than one source of data needs to be used in determining the success of reading comprehension test items. By combining sources of data such as those examined in this study (i.e., data from readers retrospective think-aloud protocols, test content evaluation, as well as the traditional test performance statistics) greater insights are gained into the reading comprehension process as well as the test taking process. This information is valuable for test developers in evaluating test items, as well as for classroom teachers

pilot investigation

sources


62

concerned with

language

more

preparing second language learners successfully.

to use the

VII References

Afflerbach, P. and Johnston, P. 1984: On the use of verbal reports in reading research. Journal of Reading Behavior 16, 307-22. Alderson, J.C. no date: A pilot study: Getting students to talk about taking a reading test. Mimeograph. Alderson, J.C. 1984: Reading in a foreign language: A reading problem or a language problem? In Alderson, J.C. and Urquhart, A.H., editors, Reading in a Foreign Language. New York: Longman. Alderson, J.C. 1988: Testing reading comprehension skills. Paper presented at the Sixth Colloquium on Research in Reading in a Second Language, TESOL, Chicago, March 1988. Alderson, J.C. 1990: Judgements in language testing. Paper presented at the Twelfth annual Language Testing Research Colloquium, San Francisco, CA, March 1990. Alderson, J.C. and Lukmani, Y. 1989: Cognition and reading: Cognitive levels as embodied in test questions. Reading in a Foreign Language 5, 253-70.

Alderson, J.C., Henning, G. and Lukmani, Y. 1987: Levels of under-

standing in reading comprehension tests. Paper presented at the Ninth Annual Language Testing Research Colloquium, Miami, FL, April 1987. Anderson, N.J. 1989: Reading comprehension tests versus academic reading: what are second language readers doing? Unpublished doctoral dissertation, The University of Texas at Austin, Austin, Texas. American Psychological Association 1985: Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association. Bachman, L.F. 1990: Fundamental Considerations in Language Testing. New York: Oxford University Press. Bachman, L.F., Davidson, F., Lynch, B. and Ryan, K. 1989: Content analysis and statistical modeling of EFL proficiency tests. Paper presented at the 11 th Annual Language Testing Research Colloquium, San Antonio, Texas, March 1989. Block, E. 1986: The comprehension strategies of second language readers. TESOL Quarterly 20, 463-94. Cohen, A.D. 1980: Testing language ability in the classroom. Rowley, MA: Newbury House Publishers. Cohen, A.D. 1984: On taking language tests: What the students report. Language Testing 1, 70-81. Cohen, A.D. 1986: Mentalistic measures in reading strategy research: Some recent findings. English for Specific Purposes 5, 131-45. Cohen, A.D. 1987a: Recent uses for mentalistic data in reading strategy


63 research. Documentacao de Estudos em Linguistica Teorica e Aplicada (DELTA) 3, 57-84. Cohen, A.D. 1987b: Research on cognitive processing in reading in Brazil. Documentacao de Estudos em Linguistica Teorica e Aplicada (DELTA) 3, 215-35. Cohen, A.D. 1987c: Student processing of feedback on their compositions. In Wenden, A.W. and Rubin, J., editors, Learner strategies in language learning. Englewood Cliffs, NJ: Prentice-Hall International. Cohen, A.D. 1987d: Using verbal reports in research on language learning. In Faerch, C. and Kasper, G., editors, Introspection in second language research. Philadelphia, PA: Multilingual Matters. Cohen, A.D. 1989: Gathering self-observational and self-revelational data on reading strategies. Paper presented at the meeting of Teachers of English to Speakers of Other Languages, San Antonio, Texas. Cohen, A.D. in press: English testing in Brazil: Problems in using summary tasks. In Hill, C. and Parry, K., editors, The test at the gate: Ethnographic perspectives on the assessment of English literacy. Cambridge: Cambridge University Press. Cohen, A.D. forthcoming: The role of instructions in testing summarizing A new decade of ability. In Douglas, D. and Chapelle, C., editors, language testing. Ann Arbor, MI: University of Michigan Press. Cohen, A.D. and Hosenfeld, C. 1981: Some uses of mentalistic data in second language research. Language Learning 31, 285-313. College Entrance Examination Board 1978: Student guide to the reading comprehension test: descriptive tests of language skills of the College Board. Princeton, NJ: Educational Testing Service. Douglas, D. and Selinker, L. 1985: Principles for language tests within the ’discourse domains’ theory of interlanguage: research, test construction and interpretation. Language Testing 2, 205-26. Duran, R.P. 1989: Testing of linguistic minorities. In Linn, R.L., editor, Educational measurement, (3rd edition). New York: Macmillan

Publishing Company. Educational Testing Service 1977: Descriptive tests of language skills of the college board: reading comprehension. Princeton, NJ: Educational

Testing Service. Testing Service 1985: Guide to the use of the descriptive tests of language skills. Princeton, NJ: Educational Testing Service. Ericsson, K.A. and Simon, H.A. 1984: Protocol analysis: verbal reports as data. Cambridge, Massachusetts: The MIT Press. Faerch, C. and Kasper, G. 1987: Introspection in second language research. Philadelphia: Multilingual Matters. Farr, R., Pritchard, R. and Smitten, B. 1990: A description of what happens when an examinee takes a multiple-choice reading comprehension test. Journal of Educational Measurement 27, 209-26. Fransson, A. 1984: Cramming or understanding? Effects of intrinsic and extrinsic motivation on approach to learning and test performance. In Alderson, J.C. and Urquhart, A.H., editors, Reading in a foreign Educational

language. New York: Longman.


64

Garner, R. 1982: Verbal-report data on reading strategies. Journal of Reading Behavior 14, 159-67. Garner, R. 1988: Verbal-report data on cognitive and metacognitive strategies. In Weinstein, C.E., Goetz, E.T. and Alexander, P.A., editors, Learning and study strategies: issues in assessment, instruction, and evaluation. San Diego, CA: Academic Press, Inc.

Gordon, C.M. 1987: The effect of testing method on achievement in reading comprehension tests in English as a foreign language. Unpublished MA thesis, Tel-Aviv University, Tel-Aviv, Israel. Grotjahn, R. 1986: Test validation and cognitive psychology: some methodological considerations. Language Testing 3, 159-85. Hughes, A. 1989: Testing for language teachers. New York: Cambridge

University

Press.

Messick, S.A. 1975: The standard problem: meaning and values in measurement and evaluation. American Psychologist 30, 955-66. Messick, S.A. 1981: Constructs and their vicissitudes in educational and psychological measurement. Psychological Bulletin 89, 575-88. Messick, S.A. 1989: Validity. In Linn, R.L., editor, Educational measurement, (3rd edition). New York: American Council on Education/ Macmillan

Publishing Company.

Nevo, N. 1989: Test-taking strategies on a multiple-choice test of reading comprehension. Language Testing 6, 199-215. Paris, S.G., Lipson, M.Y. and Wixson, K.K. 1983: Becoming a strategic reader. Contemporary Educational Psychology 8, 293-316. Pearson, P.D. and Johnson, D.D. 1978: Teaching reading comprehension. New York: Holt, Rinehart and Winston. Perkins, K. and Linville, S.E. 1987: A construct definition study of a standardized ESL vocabulary test. Language Testing 4, 125-41. Pritchard, R. 1990: The effects of cultural schemata on reading processing strategies. Reading Research Quarterly 25, 273-95.

High-level reading in the first and in the foreign language: comparative process data. In Devine, J., Carrell, P.L. and Eskey, D.E., editors, Research in reading in English as a second language. Washington, DC: Teachers of English to Speakers of Other

Sarig, G.

1987:

some

Languages. Segel, K.W. 1986:

Does a standardized reading comprehension test predict textbook prose reading proficiency of a linguistically heterogeneous college population? Unpublished PhD dissertation, The University of Texas at Austin. Tuckman, B.W. 1978: Conducting educational research (2nd edition). New York: Harcourt Brace Jovanovich, Inc.

Appendix

A

.

The following is

an example passage and an example of each question type main ideas, understanding direct statements, drawing inferences) that illustrate the tasks that the reader must perform (College Entrance Examination

(understanding Board, 1978).


65 Time: 30 minutes Directions: Each passage below is followed by questions based on its content. Answer all questions following a passage on the basis of what is stated or implied in that passage.

Sample Passage During the ’50s, each TV season offered 39 weeks of new shows, and 13 weeks of repeats. Slowly, the ratio has reversed. The ultimate goal may be a oneweek season, 50 weeks of repeats, and one week off for good behaviour. 1. The main

*(A) (B) (C) (D)

point

making is

that

television shows are being repeated more often than ever shows must be repeated to allow to prepare new shows repeated shows are used to gain good ideas for new shows repeating shows cuts down costs

2. When did the

(A) (B) *(C) (D)

the writer is

change

in television that the passage describes take

.

place?

the past year

During Only very recently Over a period of time

_

Several years ago

3. What does the writer most probably think of the situation in television that he or she is telling us about?

(A) (B) (C) *(D)

It It It It

is better than it was before. be helped.

cannot

soon improve. becoming ridiculous.

may

is


66

Appendix

B

Item

difficulty

and

point biserial correlations


Language Testing

Language Testing

Suggest Documents