high-frequency words of Dutch are known by French-speaking uni- versity students, in ... Figure 1 The item-response matrix of the Yes/No test. © 2001 SAGE ... conventional test elicits answers to particular language tasks which are defined a ...
Language Testing http://ltj.sagepub.com
Examining the Yes/No vocabulary test: some methodological issues in theory and practice Renaud Beeckmans, June Eyckmans, Vera Janssens, Michel Dufranne and Hans Van de Velde Language Testing 2001; 18; 235 DOI: 10.1177/026553220101800301 The online version of this article can be found at: http://ltj.sagepub.com/cgi/content/abstract/18/3/235
Published by: http://www.sagepublications.com
Additional services and information for Language Testing can be found at: Email Alerts: http://ltj.sagepub.com/cgi/alerts Subscriptions: http://ltj.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations (this article cites 10 articles hosted on the SAGE Journals Online and HighWire Press platforms): http://ltj.sagepub.com/cgi/content/refs/18/3/235
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
Examining the Yes/No vocabulary test: some methodological issues in theory and practice Renaud Beeckmans, June Eyckmans, Vera Janssens, Michel Dufranne and Hans Van de Velde Institut des Langues Vivantes et de Phone´tique, Universite´ Libre de Bruxelles
This article evaluates the characteristics of the Yes/No test as a measure for receptive vocabulary size in second language (L2). This evaluation was conducted both on theoretical grounds as well as on the basis of a large corpus of data collected with French learners of Dutch. The study focuses on the internal qualities of the format in comparison with other more classical test formats. The central issue of determining a meaningful test score is addressed by providing a theoretical framework distinguishing discrete from continuous models. Correction formulae based on the discrete approach are shown to differ when applied to the Yes/No test in comparison with Multiple Choice (MC) or True/False formats. Correction formulae based on the continuous approach take the response bias into account but certain underlying assumptions need to be validated. It is shown that both correction schemes display several shortcomings and that most of the data relative to the reliability of the Yes/No test presented in the literature are overestimated. Finally, several future research options are proposed in order to attain a straightforward but reliable and valid instrument for measuring receptive vocabulary size.
I Introduction As a result of the renewed interest in vocabulary acquisition and research (Meara, 1996), the demand for valid and accurate testing devices to measure learners’ vocabulary knowledge has increased. This article aims to tackle the scoring problems that arise when using the Yes/No format as a measure of receptive vocabulary. With the construct receptive vocabulary, we refer to the learners’ ability to recognize target words and to understand their meaning. The reliability of the Yes/No format will be re-assessed both by considering its theoretical grounds and by examining experimental data with regard to factors affecting validity. Address for correspondence: June Eyckmans, Institut de Langues Vivantes et de Phone´tique, CP 110, Universite´ Libre de Bruxelles, Avenue FD Roosevelt 50, 1050 Bruxelles, Belgium; email: jeyckman얀ulb.ac.be Language Testing 2001 18 (3) 235–274
0265-5322(01)LT207OA 2001 Arnold
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
236 Examining the Yes/No vocabulary test After describing the Yes/No format and its origins (Section II), the main pros and cons of this test format are listed (Section III). In Section IV a study is presented in which a Yes/No test is used as the vocabulary section of a placement test, aimed at estimating how many high-frequency words of Dutch are known by French-speaking university students, in order to be able to place them in appropriate groups in their language programme. The data obtained with the Yes/No format are carefully analysed in comparison with data available for comparable populations for more classical grammar tests. The major issue of correction formulae is addressed by giving an overview of the implications of using discrete models vs. continuous models. Experimental data confirm that different models lead to dramatically different test scores. Transforming the raw data into final test scores also leads to a marked drop in reliability. Section V discusses the implications of the different theoretical approaches in trying to take the response bias into account which appears to be the central issue in interpreting the data from Yes/No tests. II The Yes/No vocabulary test The Yes/No vocabulary test is a simple test format that intends to measure learners’ receptive vocabulary size by presenting them with a sample of words in the target language covering certain frequency levels and asking them to indicate the words they know the meaning of. 1 Historical development The Yes/No format is derived from a simple format known as the ‘checklist’, which presents the learners with a set of words and instructs them to mark the words of which they know the meaning. This format was originally used in first language research (Sims, 1929; Tilley, 1936; Zimmerman et al., 1977). Unfortunately, learners’ self-report of whether they know a word or not appeared to be a poor guide to their knowledge of vocabulary (Read, 1997a; Nation, 1990). Therefore, Anderson and Freebody (1983) decided to add pseudowords1 to the list in order to take into consideration the possibility that certain learners might overestimate their knowledge. Claiming
1 We prefer the term ‘pseudowords’ to ‘non-words’ (Read, 1997a) or ‘imaginary words’ (Meara and Buxton, 1987) since these words obey the phonotactic and morphological rules for word formation in the given language. Therefore the term ‘pseudo’ appears the most appropriate. The formation of these pseudowords is discussed in Part 5 of Section IV.
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 237 knowledge of the pseudowords leads to adjusting the score downwards to provide a better estimate of the knowledge of the real words. Meara and Buxton (1987) applied this adjusted Yes/No format to second language (L2) learners in a first attempt to establish if this test design was workable. They developed a Yes/No test with 60 real words and 40 pseudowords. Students were asked to indicate if they knew ‘the meaning of the word’. Meara and Jones (1988, 1990) developed a computerized checklist, the Eurocentres Vocabulary Size Test (Meara and Jones, 1990), which incorporates real words and pseudowords from various frequency levels. An estimate of the individual’s vocabulary size is made up to a ceiling level of 10 000 words. The same basic methodology was used in the EFL Vocabulary Test (Meara, 1992), in which some changes to the scoring mechanism were introduced. The formula proposed by Meara to calculate a representative test score was called ⌬m. 2 Scoring design of the Yes/No test The introduction of pseudowords in the test format has implications for the calculation of the test score. As there are two different kinds of item to which the learner is exposed and two possible responses, four resulting combinations are possible for each item (see Figure 1): • • • •
hit: ticking a real word; false alarm: ticking a pseudoword; miss: not ticking a real word; correct rejection: not ticking a pseudoword.
This terminology finds its origin in Signal Detection Theory (SDT) (for a description of the theory as it can be applied in the case of the
Figure 1 The item-response matrix of the Yes/No test
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
238 Examining the Yes/No vocabulary test Yes/No test, see Part 6 of Section IV) which provides a theoretical framework to allow for a description of subjects’ decision behaviour in a detection task (Green and Swets, 1966). The most straightforward way of rendering a global test result would simply be to consider the rate of correct responses (see the correct vs. false responses in Figure 1). However, among the numerous scoring methods that have been proposed in the past, this was never considered. Yet, the false-alarm rate plays an important role when calculating a test score, since the pseudowords were introduced in the test design in order to prevent an overestimation of the subject’s knowledge. All formulae encountered in the literature adhere to the same principle: among subjects with the same hit rate, those with higher false-alarm rates end up with a lower test score. Before highlighting the complex problem of scoring a Yes/No test, which is the central issue of this article, some of the general properties of the Yes/No test design are addressed in the following section.
III Pros and cons of the Yes/No test Because of its many merits, in recent years the Yes/No technique has been used for research purposes in second language acquisition (SLA) (Huibregtse and Admiraal, 1999; Hermans, 2000) and as a placement test. The test is easy to construct, administer and correct and the format permits the testing of a large number of words in a short span of time. Recently, the Yes/No test has been incorporated into the European Dialang project, a computerized design for assessing language proficiency in 14 European languages (http:/ /www.dialang.org). Within this learner-oriented framework, the testee can arrive at an estimate of his or her receptive vocabulary size in a particular target language. However, recent data raise doubts concerning the effectiveness of the Yes/No vocabulary test as a measure of receptive vocabulary size. Of the following shortcomings listed, some have already been pointed out by other researchers: • The Yes/No format itself is not clearly defined. The name suggests a format with an explicit distinction in choosing Yes or No, as was the case in Meara (1992). However, some studies – e.g., the first article on the use of the Yes/No vocabulary test in SLA by Meara and Buxton (1987) – used formats where the subjects had to tick the words they claimed to know, which does not allow
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 239
•
• •
• •
•
the identification of possible omitted responses.2 This could cause confusion when interpreting the test results. Meara’s intention however (personal communication) was that the Yes/No test should be a forced choice test, where the possibility of nonresponses is explicitly ruled out. The task with which the learner is confronted is not a test strictly speaking. Its style is situated between a conventional language test (i.e., characterized by verifiable responses) and self-assessment. A conventional test elicits answers to particular language tasks which are defined a priori and which are accordingly corrected. Self-assessment, however, is concerned with how learners judge their own ability in a particular skill (Oscarson, 1997). The status of correct/false responses clearly differs between both situations. The fact that the Yes/No test cannot be seen as either one or the other causes an ambiguity that taints the interpretation of the outcome of the test. The Yes/No format does not permit the testing of multiple meanings of a given word (Abels, 1994). There are no clear guidelines for the construction of pseudowords. There is a general consensus that the pseudowords should respect the phonotactic and morphological rules of the target language, but the extent to which they can differ from existing words remains unclear. The commonly used method (Abels, 1994) of changing more than one letter in a word in order to prevent testtakers misreading the pseudoword for the real word is not foolproof, since changing two or three letters in a real word could create a pseudoword that differs by only one letter from another real word. The test format may be problematic for subjects suffering from dyslexia, even slight forms of dyslexia. The proportion of words and pseudowords in the test varies from one study to another. Meara and Buxton (1987) and Abels (1994) used 60 words and 40 pseudowords, Meara (1992) used 40 words and 20 pseudowords per frequency range, Hacquebord (1999) used 60 words and 30 pseudowords. A different proportion of words vs. pseudowords, while using the same formula, will alter the test results. The required length of the test in order to attain a representative estimate of the vocabulary size within a certain frequency range has perhaps been underestimated in the past. Meara (personal
2 Lord (1980) distinguishes ‘omitted responses’ (i.e., items that the subject read and decided not to answer) from ‘not-reached responses’. We do not make this distinction and use the term ‘omitted responses’ in both cases.
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
240 Examining the Yes/No vocabulary test
•
•
•
•
•
communication) suggests on the basis of his early work that 60 real words is too small a sample to be workable and currently recommends 180 words vs. 120 pseudowords. On the basis of the most recent data that Meara has obtained from the Dialang project, a ratio of 100 words vs. 50 pseudowords seems to be a good compromise. Little attention has been paid to the test instruction and its implications on the learner’s choices.3 Several authors have pointed out that there are several levels to ‘knowing a word’ (Richards, 1976: Nation, 1990; Read, 1993). Even if one assumes that the Yes/No test taps into a kind of fundamental knowledge of a word, this does not rule out the possibility of complex interaction between different test instructions and several levels of knowing a word. The presentation of isolated words may reinforce a simplistic view of what ‘knowing a word’ entails. Contextualized words provide a much richer environment and may enhance the learner’s awareness of the usage of these words (Read, 1997b). There seems to be a particular problem in administering the test in situations where there is a strong lexical resemblance between the target language and the learner’s mother tongue. Meara (1996) was confronted with this cognate effect in administering an English Yes/No vocabulary test to speakers of French. The test does not perform well with low-level learners, who respond unpredictably to the pseudowords. Certain learners obtain very low scores as a result of their overwillingness to claim knowledge of the pseudowords (Meara, 1996). A problem shows up with the formulae used to calculate the test scores. The several formulae that have been proposed so far have been adapted either from the standard correction for guessing formula or from another scientific domain (i.e., SDT). Although the general principle of reducing the test score according to the size of the false-alarm rate remains the same in both cases, the precise way in which this reduction is executed (in other words, the way in which the response bias effects are dealt with), varies greatly from one approach to another. Consequently, different formulae
3 The instruction of the Yes/No test in Meara (1987) reads ‘Tick the words you know the meaning of’. In the EFL Vocabulary Test (Meara, 1992) this has changed into ‘For each word: if you know what it means, write Y (for Yes) in the box; if you don’t know what it means, or if you aren’t sure, write N (for No) in the box’. In the vocabulary tests of the piloting tool of the DIALANG project, the following instruction is given: ‘Below is a list of “words”, some of which really exist and some of which are invented. Press the “Yes” or “No” button next to each word’. Meara (personal communication) deliberately kept the instruction simple and vague. He regrets the clarification in the instruction for participants on paper: ‘Decide if the words are real words in the language you are being tested on by answering “yes” or “no”’.
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 241 applied to the same data may lead to very different results. The question remains which formula will yield the most meaningful test score. In short, literature research and initial dealings with the Yes/No vocabulary test has led us to conclude the following: First, although the Yes/No vocabulary test has obvious attractions for vocabulary assessment in SLA and for school and classroom use, there are numerous design and analysis issues which need to be addressed if this type of test is to be considered a valid measure of L2 vocabulary knowledge. Few of the drawbacks listed above have been investigated, especially from a measurement perspective. Secondly, and essentially, the problem of establishing an adequate scoring method is more than just one relevant issue among others: it constitutes a prerequisite in order to be able to address many of the aforementioned properties. The aim of this study is to give some insight into the central problem of scoring a Yes/No test. More precisely, this will be done by: • providing a theoretical framework based on the distinction between discrete vs. continuous models; • addressing the specifications of the Yes/No format in comparison with more classical test formats within this theoretical framework; • gathering empirical evidence about the specific characteristics of our student population that emerge from more classical grammar tests and evaluating their implications on the different correction schemes used for the Yes/No test; • weighing the consequences of the different correction formulae on the test reliability in relation with the response bias problem. IV The study A Yes/No vocabulary test was created to be used as part of a placement test for our French-speaking learners of Dutch to complement a grammar test. This study is motivated by the fact that the vocabulary test results and their interpretation appeared to be more problematic than those in the grammar counterpart. Some unusual trends caught our eye with reference to the results of the Yes/No test: • many subjects displayed a high rate of false alarms in their responses; • these high false-alarm rates were not restricted to weak students; • there appeared to be a disturbing inverse relationship between the ability to identify words and the ability to reject pseudowords, whereas the logic of the format expects students with a high score
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
242 Examining the Yes/No vocabulary test on the identification of words to reject most or all of the pseudowords. These findings led us to examine carefully the impact of different correction formulae on the results of the Yes/No test. Other test formats (MC and True/False tests) which have been used for many years, and the empirical data they rendered for similar student populations, were used to shed light on the relationship between the correction formulae and the test formats. Correction formulae used for MC and True/False tests are not necessarily applicable on scores of a Yes/No test. Moreover, several characteristics of the target population appear to have direct implications on the different correction schemes. 1 Participants Our participants are Belgian French-speaking university students of Economics and Business Administration following compulsory Dutch-language courses as part of their curricula. They all share a history of learning Dutch as a compulsory L2 in primary and/or secondary school but the number of course hours they followed and the levels they obtained vary greatly.4 In the first year of their university studies no languages come into play. Their levels range from weak to advanced and a placement test is required to place them into homogeneous groups for their Dutch course in their second year of university. 2 Aims of the placement test In order to be able to assign students to the appropriate classes with a minimum of administrative effort, an efficient and accurate placement procedure is needed. At first this placement test consisted of a grammar test which used to be a True/False test. Since 1999, the True/False format has been replaced by a four-alternatives MC format
4 This is partly due to the complex Belgian language policy concerning the language education in different parts of the country. Due to the latest changes in legislation (De´ cret portant sur l’organisation de l’enseignement maternel et primaire ordinaire et modifiant la re´ glementation de l’enseignement, 13 July 1998) schools that are situated in the southern part of Belgium (la re´ gion Wallonne) are released from the obligation to organize Dutch courses. The local school authorities can decide to give priority to courses in English or German as an L2 instead of Dutch. In Brussels, however, Dutch remains the compulsory L2 course. Since the Universite´ Libre de Bruxelles attracts students from all parts of the country, it will not be long before we are confronted with students who never attended Dutch courses and are in fact absolute beginnners.
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 243 and a vocabulary test was added (see Table 1). The Yes/No vocabulary test was added not so much to define what a participant’s vocabulary size is in absolute terms, but to check whether a subject knows the core vocabulary of Dutch (3700 words; see Dieltjens et al., 1995, 1997). Adequate knowledge of high-frequency words of Dutch is a prerequisite in order to deal with the course’s reading materials. The results of these tests also serve as a diagnostic feedback for the learners who are advised to fill the gaps in their grammar and vocabulary knowledge of Dutch in order to meet the minimum standards when entering the language programme the following year. Table 1 summarizes the available data which is used to illustrate the discussion below. 3 Construction of the True/False grammar test The curriculum of the first year aims to consolidate Dutch core grammar which is organized into a number of grammatical units. Bearing these units in mind, a large corpus of sentence items (grammatically correct or not) was created. In a second stage, several item analyses (Beeckmans and De Valck, 1993) served as a guide for selecting a series of 100 items in order to attain a sample of sentences representative of the defined grammatical categories and to assure a good level of reliability. A Cronbach’s alpha of about .90 was considered a threshold level. 4 Construction of the MC grammar test The same grammatical content was maintained for creating a large number of four-alternatives MC items that were extensively used within the framework of the CALL (Computer Assisted Language Learning) facilities which is part of the Dutch language course curriculum. On the basis of the difficulty index automatically recorded, 78 items were selected in order to obtain a suitable MC grammar test as part of the placement test. The same reliability standard applied here. 5 Construction of the Yes/No vocabulary test Two parallel versions (I and II) of the Yes/No vocabulary test were created, each one consisting of 60 words and 40 pseudowords. These followed the ratio of the original Yes/No test (Meara and Buxton, 1987). All words, including those transformed into pseudowords, were taken from Woorden in Context (Dieltjens et al., 1995, 1997),
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
244 Examining the Yes/No vocabulary test
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 245 a standard work which contains 3700 Dutch words selected on the basis of frequency and utility. Each sample (I and II) contained: • 25 words from the 1000-word level; • 25 words from the 1000-up to the 2000-word level; • 50 words from the 2000-up to the 3700-word level. All words were randomly selected and therefore contained verbs, nouns and adjectives, as well as conjunctions, prepositions and numerals. The pseudowords were created employing the same word alteration principles as used by Anderson and Freebody (1983) and described in Abels (1994) and Van De Walle (1999) (the latter are two studies dealing with Yes/No tests in Dutch). It should be noted that, in order to preserve the universal properties of the format, every language teacher (native speaker or not) should be able to create a Yes/No test in the target language by following a few simple rules. Occasionally, applying one word-formation procedure can result in a pseudoword that could also have been obtained by using an alternative procedure. • The first procedure consists in changing the affixes of an existing word: 22 pseudowords (i.e., the number of words in the sample which permitted this kind of change) were created like this. Example: prettig (‘fun’) is turned into pretachtig. • The second principle centres around the substitution of one or two graphemes with respect to the phonotactic and morphological rules for word formation in Dutch: the remaining 58 pseudowords were created like this. Example: timmerman (‘carpenter’) is turned into tommerman. In order to control for sequence effects and to eliminate the possibility of cheating, three forms (A, B, C), differing in item order only, were constituted for each sample. The following assignment was given in the subjects’ first language: Indiquez a` l’aide d’une croix les mots que vous connaissez. Certains mots repris dans la liste n’existent pas en ne´erlandais! (Tick the words you know. Certain words in the list do not exist in Dutch!). The time limit was 10 minutes, which was sufficiently long to respond to all items. This paper-and-pencil test was administered to 488 subjects. 6 Correction formulae A detailed comparative review of the different correction formulae proposed for transforming raw Yes/No scores has been made by Huibregtse and Admiraal (1999). An original index ISDT based on Signal
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
246 Examining the Yes/No vocabulary test Detection Theory (SDT) is proposed. When compared to other correction formulae, ISDT is shown by the authors to be the only one to render the three following criteria in a satisfactory way: • taking into account different types of correct and incorrect responses; • taking into account the correction for guessing; • neutralizing the individual response style (Nunnally and Bernstein, 1994). Huibregtse and Admiraal’s study was limited to a theoretical approach without any reference to data. The use of the different formulae led to corrected scores which, at least theoretically, could lead to dramatic differences. It was shown that Meara’s ⌬m and Huibregtse and Admiraal’s ISDT are linked in a monotonic but not linear relation. When compared with the correction for guessing formula,5 however, the relation is no longer monotonic. This results in large differences in subjects’ rank orders. Here, a different approach, based on specific test results, is proposed. Our study focuses on a comparison6 between discrete vs. continuous models which could be applied to our data. On the basis of our data, practical implications of both theoretical models are compared. A more general terminology is therefore used here, referring to ‘participants’, for example, instead of ‘learners’ and ‘responses’ instead of ‘answers’. a Discrete models: MC format: The correction for guessing formula applied to the raw scores of the multiple-choice (MC) grammar test is widely used in the field of language testing. As explained below, it consists in a correction for blind guessing (cfbg), which is not the case with other formulae that bear the ambiguous ‘correction for guessing’ label. The aim of this correction is to take into account the fact that subjects have a good chance to obtain the correct response by guessing, in which case the accounted credit fails to reflect subjects’ real knowledge. The final score therefore ultimately results in an overestimation of what is intended to be measured. The theoretical model behind the transformation from raw scores (number of correct responses) into corrected scores (number of items really known by the participants) rests on two all-or-nothing hypotheses: 5 We use the term ‘correction for guessing’ in the same sense as Huibregtse and Admiraal. Later on in this article, this terminological issue will be discussed in detail. 6 At this point we would like to clarify that the following discussion exceeds the frame of a specific domain; vocabulary testing serves merely as an illustration of what is central to this dicussion.
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 247 • Hypothesis 1: The participant either knows or does not know the answer. There is nothing in between. (This is therefore called an all-or-nothing or discrete model.) • Hypothesis 2: If the participant knows the answer, his or her choice will evidently be correct. If the participant does not know the answer, he or she will either refrain from answering or resort to a blind guess. In this case, the participant has a chance of 1/k of hitting the correct answer, k being the total number of choices. These two assumptions allow a corrected score to be inferred unequivocally from the observable data, which may be interpreted as the number of known items. This is illustrated for the particular case of a four-alternative MC in Figure 2. The observable data collected for one participant can be distributed into three separate classes: 1) the correct responses (the number of items within this class equals the raw score); 2) the incorrect responses; and 3) the items that remain unanswered by the subject. In applying the two all-or-nothing assumptions, the data are divided into two different classes: the items which are known by the participant and those which are not. The first class corresponds to the corrected score that we are looking for. The second class can be subdivided into two new sub-categories: the items for which the participant makes a choice strictly at random (i.e., blind guess) and those that the participant left unanswered (i.e., omitted response). Finally,
Figure 2 Inferring the corrected score from the observable data in the case of a four alternatives MC test using the cfbg (correction for blind guessing) formula Note: According to this model, when the subject decides to answer an unknown item, the probability of getting a lucky guess depends only on the number of alternatives
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
248 Examining the Yes/No vocabulary test the first of these sub-categories may again be subdivided into lucky guesses that result in an observable correct response and unlucky guesses that lead to an observable incorrect response. Figure 2 illustrates that the number of lucky guesses equals 1/3 (1/k–1 in general, with k representing the number of alternatives) of the number of unlucky guesses. Since the number of unlucky guesses equals the observable number of incorrect items, the corrected score can be computed by simply subtracting 1/3 of the incorrect responses from the raw score. Another way of calculating the corrected score consists of crediting one point for each correct response and subtracting 1/3 point for each false response. This formulation seems most popular among teachers and students. It allows for correct computation of the test’s reliability by means of Cronbach’s Alpha, considering that each individual item ranges from -1/3 to 1 (a necessary condition for applying the Alpha formula is that the total score is the sum of the item scores). This formulation, however, is not as explicit with regard to the underlying assumptions as the previous one. It leads one to think that it will only render the scoring more severe when in fact there is more to it. The use of the cfbg formula leads to qualitatively different results depending on the test format and the test conditions. If, with a computerized version for example, a response to each item is required (forced decision task), the omitted response category is automatically ruled out and the formula is then reduced to a simple linear transformation of the raw score.7 The rank order of the testees therefore remains unchanged. Whether the test reliability is computed from the raw data or from the corrected scores has no consequence. The only implication of the transformation is that it provides a different scale in absolute terms, which may be of interest only in a criterion-referenced approach. It should be noticed that the weaker the participant, the more this formula reduces his or her score. On the other hand, as far as the classical situation of a paper-andpencil test is concerned, the presence of a considerable number of unanswered items for several testees makes the use of the formula more imperative, and it will result in noticeable differences in the testees’ rank order. The larger the individual differences in responding
7 Generally, the corrected score is a function of both the number of correct responses and the number of omitted responses. Two testees with the same number of correct responses may end up with different corrected scores depending on their respective number of omitted responses. Clearly this will influence the testees’ rank order. In the case that the omitted response class is non-existent, the corrected score becomes a function solely of the number of correct responses (the raw score).
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 249 Table 2 Frequency distribution of the number of omitted items obtained for the three orders (A, B, C) of the Grammar MC Number of omitted items
[0] [1, 2] [3, 10] [11, 20] [21, 30] [31, 40] [41, 50] [51, 78]
Number of subjects
Subject percentage
A (n = 166)
B (n = 162)
C (n = 160)
A
B
C
129 8 10 6 6 6 0 1
117 17 8 4 8 5 0 3
128 11 8 4 3 4 1 1
77.7 4.8 6.0 3.6 3.6 3.6 0.0 0.6
72.2 10.5 4.9 2.5 4.9 3.1 0.0 1.9
80.0 6.9 5.0 2.5 1.9 2.5 0.6 0.6
or not responding to the unknown items, the more the testees’ rank order will be distorted. A poor correlation between these individual differences and the proficiency level will also increase the discrepancy in rank orders between raw and corrected scores. In other words, taking into account those items which were not answered by the testee is at the heart of the transformation. As can be seen from the results of the grammar MC, our population shows large between-testee variation in answering behaviour. Table 2 shows the distribution of the omitted response category among the testees. About a quarter of the students did not respond to all the items, and this at various degrees. Because of the statistical variability added by the transformation from raw to corrected scores, the risk of a large decrease in the test’s reliability cannot be excluded. Comparison between Cronbach’s alpha calculated with raw scores (.89) vs. corrected scores (.88), shows, however, that this decrease is insignificant for the MC. Detailed results (see Table 1) confirm this for each of the three forms (A, B, C). As there is no decrease in reliability, corrected scores obtained with the cfbg are to be considered the most appropriate in ranking students who do not answer all items while maintaining a sufficient overall measurement accuracy. True/False format: When the cfbg is applied to the results of a True/False test, all the claims made above remain relevant. The True/False format may be considered as a particular case of a MC with two alternatives. However, it should be pointed out that the probability of a blind guess reaches .50. It enlarges the correction factor and it adds a greater statistical variability. Consider, for example, two learners of the same proficiency level (20 known items out of 100): the first one refrains from answering the 80 unknown items, the other
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
250 Examining the Yes/No vocabulary test one answers all unknown items at random. In the case of a fouralternative MC, the difference between both subjects’ raw scores would be 20 vs. 40 (20 + 80/4). In the case of a True/False format, the difference would reach 20 vs. 60 (20 + 80/2). However, the variability added becomes larger: the error in estimating 80/2 will be twice that of estimating 80/4. This example illustrates that it is even more important to be aware of possible distortions in the case of a True/False format, but that care must be taken of a possible lack of test reliability when scores are corrected for guessing. As was the case for the grammar MC, data from the grammar True/False showed that several subjects did not provide a response to every item. Table 3 summarizes the data for three forms of the test that differed only in the order in which the 100 items were presented. The overall number of subjects that skipped some items represents a higher proportion of the population (32.5%) than was the case for the MC format (23.4%). Basing our results solely on the raw scores would clearly penalize those students. On the other hand, correcting for guessing leads to a systematic decrease of the test reliability, as shown in Table 1 (.94 .83 for 1997 and .90 .83 for 1998). Contrary to the situation of the MC, it is not clear whether the cfbg formula should be applied or not. Empirical data (although informally obtained) on poor predictive validity of the True/False convinced us to replace it by the four-alternatives MC format. The number of students that teachers reported to be misplaced on the basis of the True/False placement test appeared to be larger than with the MC. A second point of interest with the True/False format concerns the relationship between the performances of subjects on the true vs. false Table 3 Frequency distribution of the number of omitted items obtained for the three orders (A, B, C) of the Grammar True/False Number of omitted items
[0] [1, 2] [3, 10] [11, 20] [21, 30] [31, 40] [41, 50] [51, 100]
Number of subjects
Subject percentage
A (n = 254)
B (n = 254)
C (n = 264)
A
B
C
180 27 16 9 3 7 2 10
170 33 10 15 10 3 4 9
171 41 15 12 7 4 1 13
70.9 10.6 6.3 3.5 1.2 2.8 0.8 3.9
66.9 13.0 3.9 5.9 3.9 1.2 1.6 3.5
64.8 15.5 5.7 4.5 2.7 1.5 0.4 4.9
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 251 items. A first question is whether or not the subjects exhibit a difference in performance between both kinds of questions. Therefore the correlation between both scores was examined in comparison with the theoretical value expected under the hypothesis of no difference in behaviour between true vs. false items. The Spearman–Brown formula provides a means of calculating the reliability of half a test alphah from the entire test’s reliability alphae. The formula in this case is simply: alphah = (alphae )/(1–alphae ) Assuming there is no difference in what is measured by the two parts, alphah has been proved to equal the correlation between the scores on both half-tests (Nunnally and Bernstein, 1994). An experimental verification can also be carried out by directly comparing the obtained correlation with the average of a set of correlations between scores obtained by randomly splitting the test into half-parts. If the assumption holds (i.e., there is no difference), the correlation should not differ substantially from the theoretical value computed with the formula, and neither should it differ from the average of real correlations computed with random split. This is clearly not the case, as illustrated by the values of Table 4: subjects’ performances differ when confronted with true vs. false items. Cronbach’s alpha was calculated from the raw and corrected scores for the three respective orders (A, B, C). The mean reliability for the entire tests was .92 for the raw scores and .83 for the corrected scores. The use of the Spearman–Brown formula and the 50 random split-half correlations Table 4 Correlation between the scores on the True vs. False items of the True/False grammar test compared with theoretical (*) and estimated (**) half-test reliability Grammar True/False
Scores /100
Test Half-test reliability reliability Cronbach’s Spearman– Split-half alpha Brown (*) n = 50 mean (sd) (**)
Correlation True/False
Order A (n = 254)
Raw Corr. (cfbg)
.921 .822
.853 .697
.859 (.015) .704 (.025)
.635 .357
Order B (n = 254)
Raw Corr. (cfbg)
.924 .844
.859 .729
.862 (.013) .737 (.023)
.595 .307
Order C (n = 264)
Raw Corr. (cfbg)
.925 .817
.861 .691
.861 (.013) .693 (.026)
.577 .204
Notes: Raw scores are the number of correct responses, corrected scores are calculated with the cfbg formula. If there were no difference between what is measured by True vs. False items, the correlation should equal the half-test reliability.
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
252 Examining the Yes/No vocabulary test yielded quasi identical values: .86 for the raw scores and .71 for the corrected scores. The sd of the 50 correlations was also very low, .014 and .025. Correlations between the two test halves respectively, composed of exclusively true items and exclusively false items, appeared to be substantially lower, i.e. .60 and .29 (three versions together). This difference can be accounted for in several ways: • First, it is possible that it only reflects a difference in discriminability between the two item categories. On average, false items appear to perform8 better than true items (Table 5), and this effect is reinforced by using corrected scores. It is not surprising that false items provide more accurate information since only one grammatical issue is addressed, whereas the true items may involve more grammatical features. • Secondly, the observed difference might be due to a bias in testees’ responses. Unfortunately, in the literature the term ‘response bias’ is used to designate different phenomena. Here we use it in its ordinary sense, i.e., a tendency for a given subject to provide more/fewer responses of one type (true or yes) than of the other (false or no). As has already been pointed out by Huibregtse and Admiraal (1999), the correction for guessing does not help to eliminate a possible response bias. This is clearly confirmed by the strong decrease of the correlation between true and false items when corrected scores are used. This decrease is the opposite of what one would predict if the cfbg formula could handle response bias. Table 5 Comparison of the means of contribution to the test reliability of True vs. False items Grammar True/False
Scores /100
Contribution to the reliability True items mean (sd)
False items mean (sd)
Order A (n = 254)
Raw Corr. (cfbg)
.213 (.203) .060 (.162)
.298 (.202) .193 (.195)
Order B (n = 254)
Raw Corr. (cfbg)
.242 (.180) .099 (.171)
.264 (.217) .189 (.212)
Order C (n = 264)
Raw Corr. (cfbg)
.289 (.186) .088 (.165)
.261 (.202) .171 (.204)
8 The item discriminability index consists of a linear transformation of its contribution to the overall test reliability (Beeckmans and De Valck, 1993).
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 253 • Thirdly, both categories of items measure substantially different skills, in which case the construct validity may be questioned. Discussing these possible explanations in the case of the present grammar, True/False results would be beyond the scope of this study. Nevertheless, the correlation of .60 (raw scores) between true items and false items remains substantial. It should be emphasized that such a correlation, although significantly lower than it should be in the absence of any difference between the two classes of items, still shows the test to be reliable. Yes/No vocabulary test: Apart from the pitfalls of the cfbg formula already identified, which result essentially from the lack of realism of the all-or-nothing assumptions, its use in the case of the Yes/No vocabulary test raises further specific questions: 1) The proportion of real words vs. pseudowords varies from one published study to another. In all cases, the real words are more frequent, and this complicates the assumptions related to random guessing. If the subject has the feeling that there are more real words than pseudowords, or if he or she decides systematically to give one response when he or she does not know, the hypothesis of a probability of .50 becomes inadequate. On the other hand, constructing the test with an equal proportion of real words vs. pseudowords would render the format less economic, because fewer words could be tested in the same time span. 2) In its original form (where the subject is asked to tick the words), the distinction between false response and omitted response is not possible for the word items: if a subject has not ticked a word, this could either mean that he or she does not know the word or that he or she decided not to answer this item. The fact that it is impossible to distinguish between both alternatives undermines the central principle underlying the cfbg formula. Remember that, among the items which are not correct, the boundary between false responses and omitted responses is crucial. Moving the boundary to the very left will increase the corrected score. Conversely, moving the boundary towards fewer omitted items will decrease the corrected score. This variation, which is controlled in the case of the True/False test, cannot be controlled in the original Yes/No format. 3) The most important drawback of the classical correction for guessing concerns the nature of the task involved in the Yes/No test in comparison with the True/False test. In the latter case, the presence of a possible bias towards one or the other of the two responses can be considered as being part of the task. In the True/False grammar test, for example, a subject who tends to use only simple structures
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
254 Examining the Yes/No vocabulary test and tries very hard not to make mistakes, would exhibit a bias in judging many items to be incorrect. One could argue that this bias is part of the task and relevant with reference to the competence that is measured. On the other hand, in the Yes/No vocabulary test, the subject’s task is closer to self-assessment than to a real language task. The bias can therefore only be attributed to factors which are beyond the competence of the subject. In a study with a similar student population, Janssens (1999) showed that the students demonstrate a clear tendency of not being able to estimate their language proficiency accurately as far as vocabulary is concerned. The experiment was set up to check whether the students were able to use contextual clues to infer the meaning of words they did not know. First, the subjects were presented with a list of target words and were asked to give the French translation (a). Secondly, the subjects received a short text containing the target words and were asked to underline the words they did not know (b). Finally they were given the text plus the target words and were asked to translate the words once again (c). Comparing (b) and (c) provides a means of evaluating students’ self-assessment. Most students (69%) had a tendency to overestimate their vocabulary knowledge and there were large individual differences in their self-evaluation, not solely due to their differences in language competence. A further study would be needed to determine whether a similar potential bias existed in the present data. This issue gains importance when we consider the high number of false alarms in this experiment, in contrast with other studies. A procedure similar to the one used with the True/False data was applied to the results on the vocabulary Yes/No (see Table 6), but with two differences. First, the distinction between raw and corrected scores became irrelevant. Because no information is available concerning possible omitted responses, the cfbg is reduced to a linear transformation which does not affect the correlations nor the estimated reliabilities in any way. Secondly, because of the unequal proportion of words vs. pseudowords, in addition to splitting the whole test into halves, it was also split into two uneven parts composed of 60 and 40 items. This was done in order to allow for a meaningful comparison between the correlation obtained by the random split and the one obtained by structuring the test with 60 words vs. 40 pseudowords. As was the case with the True/False data, both procedures for inferring correlations between two complementary parts of the entire test (Spearman–Brown formula and random splitting) yielded very similar results. On average, r = .70 was obtained with form I and r = .74 with form II. The standard deviations of the 50 correlations obtained by random splitting were also very low. These values were comparable
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
Total (n = 253)
C (n = 82)
Raw Corr. (cfbg)
Raw Corr. (cfbg) Raw Corr. (cfbg) Raw Corr. (cfbg)
(n = 89)
B (n = 82)
Raw Corr. (cfbg)
Total (n = 253)
C (n = 78)
Raw Corr. (cfbg) Raw Corr. (cfbg) Raw Corr. (cfbg)
Scores /100
.850 .850
.826 .826 .877 .877 .841 .841
.820 .820
.818 .818 .771 .771 .848 .848
Test reliability Cronbach’s alpha
.739 .739
.704 .704 .781 .781 .726 .726
.695 .695
.692 .692 .627 .627 .736 .736
Spearman– Brown (*)
.759 (.023) .759
.736 (.033) .736 .797 (.032) .797 .746 (.041) .746
.722 (.032) .722
.723 (.052) .723 .664 (.052) .664 .761 (.030) .761
Split-half n = 50 mean (sd) (**)
Half-test reliability
.751 (.023) .751
.720 (.037) .720 .787 (.033) .787 .745 (.037) .745
.714 (.036) .714
.726 (.047) .726 .635 (.054) .635 .757 (.040) .757
Part-test (60–40) reliability Split-part n = 50 mean (sd) (***)
–.264
–.287
–.089
–.397
–.373
–.353
–.414
–.409
Correlation words/ Pseudowords
Notes: Raw scores are the number of correct responses on the 100 items, words and pseudowords. Corrected scores are calculated with the cfbg formula. If there were no difference between what is measured by words vs. pseudowords items, the correlation should equal the half-test reliability.
II
(n = 78)
I
B (n = 79)
Order
Vers
Vocabulary Yes/No
Table 6 Correlation between scores on words vs. pseudowords of the Yes/No vocabulary test compared with theoretical half-test reliability (*), estimated half-test reliability (**) and estimated part-test reliability (***)
R. Beeckmans et al. 255
256 Examining the Yes/No vocabulary test in size to those obtained with the corrected scores on the True/False grammar test (Table 4), but they were lower than those obtained with the raw scores on the same test. No difference was obtained between the results for both splits (half-test split vs. part-test split), and these reliabilities were very close (.02 difference on average) to the corresponding reliabilities computed with the Spearman–Brown formula. The most revealing result concerns the negative correlations that were systematically obtained between partial scores on word items vs. pseudoword items. Among the three hypotheses we listed above as possible explanations for moderate decreases in correlation in the case of a True/False, only the assumption of a bias (Hypothesis 2) can reasonably account for the systematic negative correlation in the True/False. A difference in discriminability between the two item categories (Hypothesis 1) or the fact that the two item classes measure substantially different skills (Hypothesis 3) could result in a decrease in the correlation, but it could not render it negative. The existence of a bias would, however, automatically lead towards a negative correlation, because the bias has the characteristic that it works in opposite directions at the same time. An individual bias towards ‘Yes’ responses will produce an increase in the partial score for the words together with a decrease in the partial score for the pseudowords, and vice versa. Since the correlation is, in fact, negative, there is a substantial possibility that the test will, in our case, primarily measure the response bias itself. Applying the cfbg would blur the results even more. 4) And what about the ‘other’ correction for guessing formula? So far, we have considered the correction for guessing by following the logic of a technique (cfbg) which has been developed in the specific domain of testing. We started with the MC, moved on to the True/false, and the Yes/No was thus considered as a particular case of these classical tests. Meara’s approach to the Yes/No vocabulary test, however, has been somewhat different. His initial goal was to obtain a sensible measure of the proportion of words a subject knows, and the pseudowords were added solely with the aim of correcting the obtained proportion of hits. Rather than considering the whole set of data (i.e., the raw score consists of the number of correct responses: both words and pseudowords), the raw score of interest is limited to the number of hits (60 items and not 100) so that the corrected rejections are not included. This score is then corrected by a formula which is (unfortunately) also called correction for guessing (cfg). This formula resembles the previously discussed cfbg formula to some extent, but it differs in some other important respects. Figure 3 illustrates the principle of the cfg and may usefully be
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 257
Figure 3 Inferring the corrected score from the observable data in the case of the vocabulary Yes/No test using the cfg (correction for guessing) formula Notes: According to this model, when the subject is confronted with items he or she does not know, the subject will guess one of the two alternatives in a certain proportion which is specific for this subject and independent of the nature of the unknown item (word or pseudoword). In this example, this proportion is 1 (is a word) to 3 (is not a word) for subject A and 3 to 1 for subject B. Both subjects know the same number of words but differ in their raw scores in accordance with their specific response biases.
compared with Figure 2 (cfbg). To simplify the discussion, we do not consider any further the possibility of the omitted response category, since this is irrelevant in the case of the classical Yes/No. What is common to both cfbg and cfg formulae is the first all-ornothing assumption stating that the subject knows or does not know the answer, and that there is nothing in between. However, in the case of the cfg model, the possibility of knowing the answer is limited only to the category of words. Knowing that a pseudoword is not a word is ruled out by the model and this possibility is therefore ignored. It follows that the set of items can be subdivided into two categories: the words known by the subject and the remainder of the
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
258 Examining the Yes/No vocabulary test items, that is, both the words the subject does not know and the entire set of pseudowords. The second assumption also remains the same in its first part: ‘When the subject knows the answer, his [or her] choice is evidently correct’, but is again restricted to words since ‘knowing the answer’ is now limited to ‘knowing the word’. The major difference consists in the method of estimating what the data will be if a subject is guessing. It is important to remember that in both models, cfbg and cfg, when the subject is guessing, nothing about any feature of the item which could be relevant to the measured competence can come into play. In the previous case (cfbg), whatever the subject’s strategy when guessing (always responding true or responding alternately true and false, etc.), the usual methodological precautions in designing the test format will ensure that .50 is an unbiased estimate of the probability of getting the correct response. In other words, individual response bias will not contaminate the results. The data of Figure 2 do not distinguish between a subject who may have systematically responded ‘True’ and a subject who may have systematically responded ‘False’ as far as the unknown items are concerned. By contrast, with the Yes/No format the cfg model will lead to different raw scores for two subjects who know the same number of words but who display different decision behaviour in responding to the unknown items. In the examples of Figure 3, subject A exhibits a response bias which leads to a rate of 1/4 words responses out of the unknown items, while subject B exhibits a rate of 3/4 words responses and, as can be seen, both subjects’ raw scores are very different. In conclusion, it appears that the cfg model does take into account the individual response bias. However, the way in which this bias is evaluated depends largely on the all-or-nothing assumption underlying the model. A comparison with the continuous models that we describe below makes the theoretical drawbacks of the cfg more apparent. The presence of large biases in our student population was already assessed by the negative correlation obtained between the subjects’ performances on the words and pseudowords (Table 6). Applying the cfg formula should logically lead to a decrease in reliability. The results presented in Table 7 confirm this prediction. Both estimation procedures, half and part splits, led to comparable results. On average, the decrease in reliability goes from .91 for the raw score (based on a total of 60 words) to .85 for the corrected (cfg) score. The very high reliability with raw scores is obviously artefactual because the number of correct words also measures the bias itself. In the case of corrected scores, the situation is less clear because of the impossibility
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 259 Table 7 Effect of the correction for guessing (cfg) on the Yes/No test reliability when the score is limited to the 60 words Vocabulary Yes/No
Scores /60
Test reliability Cronbach’s alpha
Spearman– Brown
Split-half n = 50 mean (sd)
Raw Corr. (cfg) Raw Corr. (cfg) Raw Corr. (cfg)
.910 .842 .884 .819 .920 .867
.835 .792 .852
.841 .727 .802 .694 .858 .765
Total (n = 235)
Raw Corr. (cfg)
.906 .843
.828
.835 (.017) .728 (.033)
.859 (.021) .721 (.039)
A (n = 89)
Raw Corr. (cfg) Raw Corr. (cfg) Raw Corr. (cfg)
.909 .845 .914 .875 .907 .848
.833 .842 .830
.841 .732 .849 .777 .830 .736
.827 .722 .841 .760 .827 .732
Raw Corr. (cfg)
.909 .855
.833
.837 (.016) .747 (.029)
Vers. Order
I
A (n = 78) B (n = 79) C (n = 78)
II
B (n = 82) C (n = 82) Total (n = 253)
Half-test reliability
(.027) (.055) (.029) (.048) (.021) (.034)
(.022) (.040) (.023) (.041) (.030) (.048)
Part-test (60–40) reliability Split-part n = 50 mean (sd) .837 .738 .794 .663 .852 .758
(.024) (.046) (.036) (.064) (.026) (.042)
(.024) (.044) (.023) (.039) (.034) (.045)
.829 (.021) .736 (.029)
Notes: The raw score is the number of hits, i.e., the number of correct words. The corrected score is computed with the cfg formula and in this case, Cronbach’s alpha can only be estimated from the split-half reliability by means of the Spearman–Brown formula.
of dealing with possible omitted responses which were shown to be frequent with the same population in the case of the MC and the True/False. It is therefore possible that the reliability of .85 remains overestimated. In conclusion, controlling the response bias for the Yes/No format in order to be able to eliminate it from the raw data appears to be a central issue. Correction methods based on the discrete model deal with this problem only in an indirect way. On the other hand, response bias is at the heart of the theory based on the continuous model (i.e., SDT); this is discussed in the next section. b Continuous models: Continuous models can deal with the issue mentioned above because their theoretical foundations clearly distinguish the sensitivity, which is the relevant variable, from the response bias, which has to be identified and ruled out. The theory was initially formulated in signal detection (SDT; e.g., Tanner and Swets, 1954; Green and Swets, 1966) where the task consists of
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
260 Examining the Yes/No vocabulary test detecting a signal which is present or absent. Most experimental evidence of the model has also been established in this field. This model is an alternative to the threshold theory.9 This threshold theory, which also rests on all-or-nothing assumptions, is the counterpart of the correction for guessing formula within detection theory. Opposite to the discrete models is the d’ or ROC (Receiver Operating Characteristics) curves model. The advantages of this continuous model have been widely confirmed, first in the field of signal detection and later on in various other domains: experimental psychology, medicine, meteorology, etc. (Swets et al., 2000). Because it is not yet widely used in the domain of language testing we provide a detailed description of the method as it can be applied in the case of the Yes/No vocabulary test (see Figure 4). Contrary to the discrete approach, SDT posits a continuum ranging from ‘being sure of the presence of a signal/word’ to ‘being sure of the absence of the signal/pseudoword’. The middle of the continuum corresponds to maximal doubt. This continuum represents a latent dimension for a particular subject as represented in Figure 4a. In the case of a confidence rating of the responses, the dimension as well as the distribution of the items on this dimension can be observed. Conversely, in the case of a Yes/No format, this dimension cannot be observed and has to be inferred by the model. The task thus forces the subject to dichotomize the information which the model assumes to be continuous. When an item is proposed to a testee, this item falls somewhere on the testee’s latent confidence rating scale. If the subject’s proficiency is not zero, a pseudoword will tend to fall on the left-hand side (is not a word) of the continuum and a word will tend to fall on the right-hand side (is a word) of the continuum. The whole set of items will split into two distributions, one for the pseudowords located on the left, one for the words located on the right. The shape of these distributions is crucial for the application of the model since different shapes will lead to differences in the testees’ rank orders. This issue will be reconsidered in detail below; for now, let us assume them to be Gaussian curves with equal variances. The distance separating both distributions (the lack of overlapping) for one testee is called the sensitivity (d’) for this testee. It is directly linked to the testee’s competence. The core hypothesis is that the distributions, and thus the competence that one wants to measure, are strictly
9 It should be noted that ⌬m as used by Meara, although based on SDT, also rests on all-ornothing assumptions. We do not examine this formula in detail since this has already been done in a complete and convincing way by Huibregtse and Admiraal (1999).
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 261
Figure 4 Results of several testees for the Yes/No test as modelled by SDT Notes: d’: sensitivity; Cr.: criterion.
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
262 Examining the Yes/No vocabulary test independent of the decision-making process. The decision-making process itself depends on the position of a criterion (Cr.) that is placed somewhere along the latent dimension. Every item falling at the left of Cr. will result in a ‘is not a word’ decision, every item falling at the right of Cr. will result in a ‘is a word’ decision. The more the criterion is situated vs. the pole ‘is a word’, the more the testee tends to answer ‘is not a word’ when he or she hesitates, and vice versa. Figures 4a to 4g illustrate the way in which the model formalizes the distributions for different testees for whom the percentages of correctly detected words (hits in SDT terms) are the same, but who vary across their percentage of responding ‘is a word’ in the case of pseudowords (false alarms in SDT terms). Testee 4a is special in the sense that his decision criterion Cr. lies at the intersection of the two distributions. In this particular case, the percentage of miss equals the percentage of false alarms. This means that the percentage of mistakes in both ways (not ticking a word vs. ticking a pseudoword) is the same. The decision criterion used by this testee is neutral: one could say that there is no response bias in this case. Testees 4b, 4c and 4d have a growing percentage of false alarms. The effect of these increasing false alarms creates a shift of the decision criterion (testee 4d, for example, has a preference for the answer ‘is a word’ in case of doubt) as well as – and this is the relevant information – a decline of sensitivity (testee 4d is almost unable to distinguish between words and pseudowords). On the other hand, a decline of the number of false alarms (testees 4e, 4f and 4g) is found with testees who are both more cautious and more competent. The comparison between the d’ of two extreme testees 4d and 4g illustrates the importance of the false-alarms factor in distinguishing testees with a low or a high proficiency. Another representation of the same data is provided by the ROC curves. The axes correspond to the area under both distributions ranging from Cr. to infinite: the grey area (y-axis) corresponds to the number of answers ‘is a word’ among the presented words (hits), the dotted area (x-axis) corresponds to the number of ‘is a word’ among the presented pseudowords (false alarms). The positive diagonal corresponds to the extreme case of a testee of zero competence. The negative diagonal corresponds to a testee whose response bias is neutral, like testee 4a. Curves computed from both areas and passing through each of the points represent all testees of the same competence (same d’) and differing by their Cr. position. The symmetry of the curves is in relation with the variance equality of both distributions. The general shape of the curves is dependent on the theoretical distribution which, in this case, is supposed to be Gaussian. Practically speaking, two measures can be regarded as corrected
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 263 scores: either the d’ itself, which takes into account the discriminability between words and pseudowords, or the percentage of correct responses among the words by computing what this percentage would have been if the criterion Cr. had been neutral. Both indices are linked in a monotonic and almost linear relation except for testees who perform very well. The advantage of the last option is that it can be interpreted as a percentage of known words, which is the primary aim of the test. It also provides an operational definition of what is meant by ‘knowing a word’: the testee is considered to make the same number of mistakes in both directions (false-alarms rate = 1 – hits rate). However, a last transformation is needed in order to obtain the convenient range of variation (0 to 100%) from the original range of variation (50 to 100%). (When a testee’s competence is zero, the two distributions overlap and the d’ = 0. In this case the Cr. is situated at the mean of both distributions and the area on the right side of Cr. is still 50%.) The resulting corrected score corresponds to the number of Hits corrected for bias (Hcfb). The two theoretical assumptions which need to be discussed are the Gaussian hypothesis and the equality of variance. • Normal distributions for words and pseudowords: This question may be addressed on both theoretical and empirical grounds. Theoretically, the normal assumption is based on the central limit theorem which states that a variable will exhibit a normal distribution if it is the sum of a large number of independent variables. It is rather speculative to assume it to be the case for both the word and pseudoword distributions. On practical grounds, the normality has been widely ascertained from empirical data, but this depends principally on the particular application domain. As far as we know, no relevant data are currently available in language testing. In an attempt to avoid this problem, Huibregtse and Admiraal (1999) proposed using another index (Isdt) based on the geometrical properties of the ROC curves (Hodos, 1970). However, this measure, albeit a non-parametric one, is not truly distribution free: MacMillan and Creelman (1996) pointed out that these measures are based on the implicit assumption that the underlying distributions are equal variance logistic functions. Both curve families, normal and logistic, have very similar shapes so that the difference in using either Hcfb or Isdt for correcting the raw score should turn out to be fairly small. Nevertheless, experimental evidence about the shape of the distributions is still needed, and this important question may not be ruled out by deciding to use Isdt instead of Hcfb. • Equal variance distributions: The assumption of equal variance is more important for the model to be valid with our data than is
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
264 Examining the Yes/No vocabulary test the normality assumption.10 Figure 5 shows the effect on the modelling of testee 4a (the neutral subject of Figure 4) when increasing the variance of the word’s distribution. Three progressively different values of standard deviation ratios between both distributions are considered. In the first case, a ratio of .90, there are few consequences for the model. The representation in the ROC space shows that there is no large difference between both theoretical curves, i.e., the curve corresponding to testee 4a and the curve corresponding to the same testee under the assumption of moderate variance inequality (a .90). The separability of the sensitivity (d’) and the bias (Cr. position) remains comprehensible, as does the operational definition of ‘knowing a word’. When the standard deviation ratio is .60, the ROC curves are substantially different, which leads us to question the usefulness of the equal variance model. The criterion position corresponding to a neutral response bias becomes problematic. As shown in the second panel of Figure 5, the small areas under both distributions limited by Cr. are no longer equal in size, which undermines the principle of equating the false-alarm and miss rates. In the last case, very large differences in the spread of words vs. pseudowords (a .30) are displayed. Clearly, all appealing assumptions underlying the SDT model in the equal variance case model break down: the position of Cr. in the case of a neutral bias, if left at the middle of the d’, does not correspond with the intersection of the distributions. In such an extreme case, the distributions render two intersection points (see ×1 and ×2 on the third panel of Figure 5). There are more words than pseudowords at the very left of the confidence scale (‘is not a word’), which does not make sense. The ROC curve is also very different: it is widely asymmetrical and it goes beyond the zero competence line represented by the positive diagonal. In conclusion, if there is evidence that the distribution variances differ moderately, this should be taken into account when computing the raw score. If the variances appear to be extremely different, the SDT should be ruled out. In intermediate situations, the d’ value should also be reconsidered. As already mentioned, the underlying distributions could be directly observed on the basis of confidence judgements (see, for example, Green and Moses, 1966). We do not have such data at our 10 This is generally not the case. When SDT is applied to signal detection (the domain in which the theory has been developed), the signal is generally the same at each trial, so that large variations between noise distribution, on the one hand, and noise and signal distributions, on the other hand, are unlikely to occur.
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 265
Figure 5 Three examples of the unequal-variance Gaussian case for testee a (the neutral subject of Figure 4) and the implications on the ROC curves
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
266 Examining the Yes/No vocabulary test
Figure 6 Fitting Gaussian curves to the obtained distributions (words and pseudowords) of different rates of word responses cumulated over the 488 testees Note: The ratio pseudowords sd/words sd is .90
disposal. Nevertheless, a first indirect control may be applied to our data in the following way. If the standard deviation difference is assumed to be constant for all testees, the distributions of different rates of word responses for both item categories would show a similar discrepancy when averaged on the population. Figure 6 shows the result of this control for the Yes/No data. For example, the third bar (light grey) indicates that almost 20% of the testees have responded ‘is a word’ on 10% to 15% of the pseudowords. When fitting Gaussian curves to the obtained frequencies of testees who have produced different rates of word responses to each item category, the word distribution appeared to be slightly more spread than the pseudoword distribution. A sd ratio of .90 was obtained which corresponds to the bold curve in the ROC representation of Figure 5, suggesting that these data are not too far from the initial hypothesis of equal variances. We emphasize that this control cannot be considered a proof but a reverse result (i.e., a sd ratio very far from unity) should have been sufficient to discard the SDT model for these data. c Comparison between discrete and continuous models: A comparison between the effects on raw scores by applying cfg, on the one hand, and the two transformations based on the continuous model, on the other hand, leads to differences which may be very large, especially when the rate of hits is high. Figure 7 illustrates the effect of the three formulae for two different hit rates, .70 and .90. When the false-alarm rate increases, both corrections with the continuous model Hcbn and Isdt reduce the score by a comparable amount, much more than does the discrete model. Such large differences are not only theoretically possible, they do in fact occur in our data. They
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 267
Figure 7 Differences in applying different correction formulas for two different hit rates
are the direct consequence of the differences between both models in taking the response bias into account. Given the size of these differences, it was decided to examine the effect of the two continuous correction formulae on the test reliability. The method for estimating Cronbach’s alpha was exactly the same as that used for the cfg. The results are shown in Table 8, which reproduces the previous values with the raw score (out of 60 words) and the cfg correction to allow for a complete comparison. Clearly, the decrease in reliability is systematically larger (and also large in absolute terms) with the continuous model than with the cfg, which shows intermediate values. On average, the reliabilities are .91 (raw scores), .85 (cfg), .80 (Isdt) and .77 (Hcbn). It would be premature to conclude from these figures that the cfg does a better job at correcting the score than do both other correction formulae. It is possible that part of the bias remains in the cfg scores (remember the large differences shown in Figure 7 with large hit rates) which results in an overestimation of the test reliability. Also, the impossibility of identifying potential omitted responses may have come into play. V Discussion The present study raises a number of questions, both on general methodological grounds and on the nature of the empirical results obtained with the Yes/No vocabulary test. From a methodological point of view, a distinction between the correction for guessing formula used with classical MC tests (cfbg) and the apparently similar formula used with the Yes/No test (cfg) has been shown to be indispensable. In the first case (cfbg), the correction takes into account
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
268 Examining the Yes/No vocabulary test Table 8 Effect of the three different corrections on the Yes/No vocabulary test reliability when the score is limited to the 60 words Vocabulary Yes/No
Scores /60
Test reliability Cronbach’s alpha
Spearman– Brown
Split-half n = 50 mean (sd)
(cfg) (Isdt) (Hcfb)
.910 .842 .780 .748 .884 .819 .751 .739 .920 .867 .802 .766
.835 .792 .852
.841 .727 .639 .597 .802 .694 .601 .586 .858 .765 .669 .621
(.027) (.055) (.065) (.065) (.029) (.048) (.063) (.067) (.021) (.034) (.036) (.037)
.837 .738 .652 .738 .794 .663 .568 .663 .852 .758 .666 .758
(.024) (.046) (.054) (.046) (.036) (.064) (.060) (.064) (.026) (.042) (.049) (.042)
Total (n = 235)
Raw Corr. (cfg) Corr. (Isdt) Corr. (Hcfb)
.906 .843 .782 .754
.828
.835 .728 .642 .605
(.017) (.033) (.041) (.041)
.859 .721 .636 .721
(.021) (.039) (.042) (.039)
A (n = 89)
Raw Corr. (cfg) Corr. (Isdt) Corr. (Hcfb) Raw Corr. (cfg) Corr. (Isdt) Corr. (Hcfb) Raw Corr. (cfg) Corr. (Isdt) corr.(Hcfb)
.909 .845 .787 .737 .914 .875 .851 .842 .907 .848 .805 .765
.833 .842 .830
.841 .732 .649 .583 .849 .777 .740 .727 .830 .736 .673 .619
(.022) (.040) (.042) (.046) (.023) (.041) (.049) (.048) (.030) (.048) (.052) (.057)
.827 .722 .635 .560 .841 .760 .736 .722 .827 .732 .680 .631
(.024) (.044) (.047) (.055) (.023) (.039) (.050) (.051) (.034) (.045) (.046) (.055)
Raw Corr. (cfg) Corr. (Isdt) Corr. (Hcfb)
.909 .855 .817 .787
.833
.837 .747 .690 .649
(.016) (.029) (.032) (.032)
.829 .736 .686 .643
(.021) (.029) (.030) (.032)
Vers. Order
I
A (n = 78)
B (n = 79)
C (n = 78)
II
B (n = 82)
C (n = 82)
Total (n = 235)
Raw Corr. Corr. Corr. Raw Corr. Corr. Corr. Raw Corr. Corr. Corr.
(cfg) (Isdt) (Hcfb) (cfg) (Isdt) (Hcfb)
Half-test reliability
Part-test (60–40) reliability Split-part n = 50 mean (sd)
Notes: The raw score is the number of hits, i.e., the number of correct words. With the three different corrected scores, Cronbach’s alpha is estimated from the split-half reliability by means of the Spearman–Brown formula.
true random guessing which can be estimated on the basis of the test characteristics, thus independently of the testee’s decision behaviour. As a consequence, a decrease in reliability after correction is observed, which is exclusively dependent on the amount of omitted responses. This factor can easily be ruled out or controlled. In the
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 269 second case (cfg), the correction provides a control for the testee’s response bias rather than for blind guessing. This bias is estimated on the basis of a discrete model in which the testee’s decision rule appears to be far from realistic. The possibility of doubt as well as the possibility of being wrong when judging a pseudoword truthfully are occurrences the model cannot deal with. Moreover, the potential confusion between ‘is not a word’ responses and omitted responses when using the classical format (Meara and Buxton, 1987) renders the formula less effective. This raises suspicion on the apparently more reliable results that were obtained when using the cfg correction. When considering the continuous model, SDT provides a theoretical framework which seems appealing for estimating the response bias in order to eliminate it. Truthful mistakes about words or pseudowords and uncertainty in the response are clearly taken into account by the continuous model. It also provides two spaces of representation – the internal continuous confidence rating scale and the ROC curves – which are, from an illustrative point of view, very explicit. However, for the model to be useful, theoretical assumptions have to be posited. The hypothesis of variance equality appears to be the most difficult to maintain. The obtained data seem to confirm this assumption. However, the possibility that a more proficient population would lead to a more narrow word distribution cannot be excluded (see Eyckmans et al., forthcoming). Yet, as far as native speakers are concerned, it is obvious that the variance equality hypothesis is not tenable. When setting up the material for a new version of the test, native speakers never doubt the existence of the words when working at high frequency levels. Therefore, the distribution for the words on a native’s confidence rating scale should have a variance close to 0. On the other hand, among the pseudowords obtained by applying the transformation rules, it sometimes happens that the constructed pseudoword exists but is unknown by the native speaker because it is a very rare word. In fact, all pseudowords have to be checked to make sure they do not exist. When native speakers would be asked to give a confidence rating of the likelihood of a pseudoword’s existence, some variance may be expected. A theoretical model derived from SDT – varying in both the d’ and the word variances – could be postulated in order to describe different proficiency levels, including that of a native speaker. Such a sophisticated model has to be tested by direct measurements on a confidence rating scale. Another serious issue which could also be addressed through investigating the confidence rating scales is the between-subject independence of the variance distribution differences. If this difference were to be unstable between subjects (of the same proficiency level), the general approach of SDT would be totally inadequate.
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
270 Examining the Yes/No vocabulary test A last issue which could question the continuous approach concerns the implicit assumption of what could be called the ‘homogeneity’ of the task. The SDT description implies that the testee applies one and the same decision process when answering the items. The only difference between one item and another is a supposedly unidimensional confidence scale on which the item is placed. It is possible, however, that when the testee believes the item to be a known word, he or she adopts one specific strategy which is quite different from when he or she believes the item not to be a known word. It should be emphasized that the testee’s belief is under consideration, which is different from presuming different behaviour with words vs. pseudowords because this objective distinction is not available to the testee. If two different strategies come into play, none of the models (discrete or continuous) would be justified because the pseudowords would not constitute an adequate control for the response bias. In other words, the bias in the case of items believed to be known words could be different from the bias in case of items believed not to be known words so that inferring the bias in case of words on the basis of the bias in case of pseudowords would be spurious. A second methodological aspect to consider is the influence on the Yes/No test reliability of applying corrections to the raw scores, whatever the ultimate choice between both models, discrete or continuous. This problem cannot be ruled out by simply obliging the testees to answer each item as was the case with the cfbg formula for correcting the True/False test. To deal with this issue, it is worthwhile reconsidering a major conceptual foundation in psychometric theory, namely the distinction between validity and reliability. It should be noted that reliability and validity are used in the narrow and precise sense they have within the terminology of classical test theory, so that confusions with the wider concept of validation should be avoided. From a measurement perspective, reliability has, of course, an influence on validity. A weak reliability will never lead to a very high validity. However, the opposite is not true. A very high reliability does not imply a high validity because the reliability is primarily an indication of the accuracy of what is measured, independently of the extent to which the test is in fact measuring what it is supposed to measure. This principle has to be kept in mind especially when biases can intervene in what is measured, like the response bias under discussion. Two historical facts may explain why this question has sometimes been underestimated. The first is that reliability measures such as Cronbach’s alpha are presented as a measure of internal consistency, i.e., all items measuring essentially the same thing. It is obvious that the higher the internal consistency among items, the more reliable the test will be, other
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 271 things remaining equal. But it is also true that increasing the number of items will increase the test reliability. However, it would be rather unfair to state that Cronbach’s alpha is a measure of the number of items. Reliability coefficients like Cronbach’s alpha, KR20 and KR21 should be considered primarily as a measure of the accuracy of what is measured by the test (putting the emphasis on whatever it measures) and that internal consistency is one of the factors which will increase the obtained accuracy. It then becomes clear that trying to increase the reliability by only improving the internal consistency may lead to a measure which will become more reliable but less valid. The second historical fact has to do with the problem of test unidimensionality. Although there is no current consensus about this important question, most of the proposed methods are based on a factorial analysis approach. If some of the items measure essentially one thing and other items another thing which correlates poorly with the first one, then the usual methods will capture the two dimensions involved in the test. However, if each item measures two strictly independent things by a roughly similar amount, then any procedure based on factorial analysis will assess (unduly) test unidimensionality. Moreover, if the accuracy of one or the other measured factor is high, then the resulting reliability might be high as well. The results obtained here suggest this to be the case when the response bias remains involved in the measurement. The following examples illustrate what has to be considered questionable on the basis of these arguments. The reliability of .91 computed with KR21 obtained by Meara and Buxton (1987) might be overestimated if the response bias were not properly ruled out. In a study by Shillaw (1996), in which it is argued that the non-words are unnecessary in the Yes/No format and that they detract from the measurement quality (Read, 2000), the reliabilities show the same trend as obtained here, i.e., a marked increase in reliability when considering the scores on words only. However, our conclusion is definitely different since the rise of reliability has to be interpreted as the consequence of the accurate measurement of the bias and not as a guarantee for a more accurate measurement of vocabulary knowledge. Another argument in favour of this assumption is the much higher reliability obtained when considering the criterion value (estimated alpha = .89), as compared to the corresponding corrected scores. One way of confirming the artefactual nature of the reliability obtained when the bias remains embedded in the measurement would be to compare the Yes/No assessment with another vocabulary knowledge assessment in which the response bias cannot intervene (see Eyckmans et al., forthcoming).
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
272 Examining the Yes/No vocabulary test VI Conclusions The data presented in this study permit us to assert that the Yes/No format in its current form does not meet the required standards in terms of reliability. It suffers from a bias which cannot be handled by one of the correction methods while maintaining a sufficiently accurate measurement. The continuous model seems to be the most appealing one in order to deal with this bias, but its theoretical assumptions have to be further validated. Whatever the ultimate choice between these models – discrete vs. continuous – it should be emphasized that the presence of a bias in the population may lead to spuriously high test reliabilities. Empirical evidence concerning the format’s validity is required in order to solve this dilemma. In addition, we should not lose sight of the possible influence of other factors (linguistic, sociocultural, meta-cognitive, etc.) on the size of the bias and the role of the test instruction herein. Further improvement of the test content is also feasible: some methodological guidelines (which may be language specific) for a more motivated choice of words together with clear alteration principles for the pseudowords should be formulated. Systematic investigation of the aforementioned theoretical issues together with empirical research will show whether the Yes/No vocabulary test is an accurate measure of receptive vocabulary size. Acknowledgements First of all, we would like to thank all the subjects that participated in this study. Also a heart-felt thank you to our colleagues of the Dutch and English departments for their assistance in administering and correcting the tests and reading the manuscript. We are indebted to Frank Boers, Philippe Anckaert, Ineke Huibregtse and Gordon Ramsey for the time they invested in reading and commenting on this study and for checking the quality of our English. Finally, we would like to express our gratitude to Paul Meara and the anonymous reviewers of Language Testing for their thoughtful reading and insightful remarks. We feel that our article has surely benefited from their recommendations. VII References Abels, M. 1994: Ken ik dit woord?. Unpublished doctoral dissertation, Catholic University of Nijmegen. Available from the author. Anderson, R.C. and Freebody, P. 1983: Reading comprehension and the assessment and acquisition of word knowledge. In Hutson, B.A., editor, Advances in reading/language research. Greenwich: JAI Press.
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
R. Beeckmans et al. 273 Beeckmans, R. and de Valck, I. 1993: L’analyse des items sur base d’un indice unique de contribution a` la fide´ lite´ globale: description de la me´ thode et e´ tude exploratoire des conditions d’application a` des donne´ es fortement bruite´ es. Rapport d’activite´ s de l’Institut de Phone´ tique de Bruxelles 29, 51–81. Dieltjens, L., Vanparys, J., Baten, L., Claes, M.-T., Alkema, P. and Lodewick, J. 1997: Woorden in Context Deel 1. Brussels: De Boeck. —— 1995: Woorden in Context Deel 2. Brussels: De Boeck. Eyckmans, J., Beeckmans, R., Janssens, V., Dufranne, M. and Van De Velde, H. forthcoming: An inquiry into the validity of the yes/no vocabulary test: can concurrent validation help to establish a sensible correction method? Unpublished manuscript, submitted for publication. Available from the author. Green, D.M. and Moses, F.L. 1966: On the equivalence of two recognition measures of short-term memory. Psychological Bulletin 66, 228–34. Green, D.M. and Swets, J.A. 1996: Signal detection theory and psychophysics. New York: John Wiley. Hacquebord, H. 1999: Less- en luisterbegrip van studieteksten bij Nederlandse en anderstalige leerlingen en studenten. In Huls, E. and Weltens, B., editors, Artikelen van de Derde Sociolinguı¨stische Conferentie. Delft: Eburon. Hermans, D. 2000: Word production in a foreign language. Unpublished dissertation, University of Nijmegen. Available from the author. Hodos, W. 1970: Nonparametric index of response bias for use in detection and recognition experiments. Psychological Bulletin 74, 351–54. Huibregtse, I. and Admiraal, W. 1999: De score op een ja/nee woordenschattoets: correctie voor raden en persoonlijke antwoordstijl. Tijdschrift voor Onderwijsresearch 24, 110–24. Janssens, V. 1999: Over ‘slapen’ en ‘snurken’ en de hulp van de context hierbij. ANBF-nieuwsbrief 4, 29–45. Lord, F.M. 1980: Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. MacMillan, N.A. and Creelman, C.D. 1996: Triangles in ROC spaces: history and theory of ‘nonparametric’ measures of sensitivity and response bias. Psychonomic Bulletin and Review 3, 164–70. Meara, P. 1992: EFL vocabulary test. Swansea, UK. Centre for Applied Language Studies. —— 1996: The dimensions of lexical competence. In Brown, G., Malmkjr, K. and Williams, J., editors, Performance and competence in second language acquisition. Cambridge: Cambridge University Press, 35–53. Meara, P. and Buxton, B. 1987: An alternative to multiple choice vocabulary tests. Language Testing 4, 142–51. Meara, P. and Jones, G. 1988: Vocabulary size as a placement indicator. In Grunwell, P., editor, British Studies in Applied Linguistics, Volume 3: Applied Linguistics in society. London: Centre for Information in Language Teaching and Research. —— 1990: Eurocentres vocabulary size test, version 10KA. Zurich: Eurocentres Learning Service.
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
274 Examining the Yes/No vocabulary test Nation, P. 1990: Teaching and learning vocabulary. New York: Newbury House. Nunnally, J.C. and Bernstein, I.H. 1994: Psychometric theory. New York: McGraw-Hill. Oscarson, M. 1997: Self-assessment of foreign and second language proficiency. In Clapham, C. and Corson, D., editors, Encyclopedia of Language and Education, Volume 7: Language Testing and Assessment. Dordrecht: Kluwer Academic Publishers. Read, J. 1993: The development of a new measure of L2 vocabulary knowledge. Language Testing 10, 355–71. —— 1997a: Vocabulary and testing. In Schmitt, N. and McCarthy, M., editors, Vocabulary: description, acquisition and pedagogy. Cambridge: Cambridge University Press, 303–20. —— 1997b: Assessing vocabulary in a second language. In Clapham, C. and Corson, D., editors, Encyclopedia of Language and Education, Volume 7: Language Testing and Assessment. Dordrecht: Kluwer. —— 2000: Assessing vocabulary. Cambridge: Cambridge University Press. Richards, J.C. 1976: The role of vocabulary teaching. TESOL Quarterly 10, 77–89. Shillaw, J. 1996: The application of Rasch modelling to yes/no vocabulary tests. Vocabulary Acquisition Research Group. Discussion document number js96a, available over the internet at www.swan.ac.uk/cals/ vlibrary/js96a.htm Sims, V.M. 1929: The reliability and validity of four types of vocabulary test. Journal of Educational Research 20, 91–96. Swets, J.A., Dawes, R.M. and Monahan, J. 2000: Psychological science can improve diagnostic decisions. Psychological Science in the Public Interest (supplement to Psychological Science), 1(1), 1–26. Tanner, W.P., Jr. and Swets, J.A. 1954: A decision-making theory of visual detection. Psychological Review 61, 401–409. Tilley, H.C. 1936: A technique for determining the relative difficulty of word meanings among elementary school children. Journal of Experimental Education 5, 61–64. Van De Walle, P. 1999: Onderzoek naar de omvang van de receptieve en productieve kennis van de basiswoordenschat van zesdeklassers uit het A.S.O. in het Brussels Hoofdstedelijk Gewest. Me´ moire en vue de l’obtention du titre de Licencie´ en Langues et Litte´ ratures Germaniques, Universite´ Libre de Bruxelles. Unpublished manuscript. Available from the author. Zimmerman, J., Broder, P.K., Shaughnessy, J.J. and Underwood, B.J. 1977: A recognition test of vocabulary using signal-detection measures and some correlates of word and non word recognition. Intelligence 1, 5–13.
Downloaded from http://ltj.sagepub.com at PENNSYLVANIA STATE UNIV on April 14, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.