Feb 2, 2010 - Study 1 examined the effects of block size on flashcard learning. ...... college students and found that 40% - 67.6% of them study with flashcards. ...... Sassenrath & Yonge, 1969; Sturges, 1969, 1972), in other studies (e.g., Kulhavy & ...... Research and Development, 49, 23â36. doi:10.1007/BF02504913.
OPTIMISING SECOND LANGUAGE VOCABULARY LEARNING FROM FLASHCARDS
BY
TATSUYA NAKATA
A thesis submitted to the Victoria University of Wellington in fulfilment of the requirements for the degree of Doctor of Philosophy in Applied Linguistics
Victoria University of Wellington (2013)
ii ABSTRACT The purpose of this thesis is to investigate how we can optimise vocabulary learning from flashcards (word cards) in a second or foreign language (L2). Flashcards are a set of cards where the L2 word is written on one side and its meaning, usually in the form of a first language (L1) translation, L2 synonym, or L2 definition, is written on the other. Despite the increasing recognition that flashcard learning is effective, efficient, and useful, our understanding of the optimal way to learn from flashcards is still limited. In order to investigate how we can optimise L2 vocabulary learning from flashcards, this thesis examined the effects of the following factors on flashcard learning: block size, retrieval formats, absolute spacing, relative spacing, retrieval frequency, and feedback timing.
Study 1 examined the effects of block size on flashcard learning. Block size refers to the number of words to be learnt at once. Existing studies on block size are limited in that block size and spacing were confounded. Study 1 set out to investigate the effects of block size in a more rigorous manner than existing studies by manipulating spacing as well as block size. The results showed that although a large block size is more effective than a small one when spacing is confounded, there is no difference between the two when they have equivalent spacing. The findings imply that introducing a large amount of spacing between encounters may be more important than using a particular block size. Study 1 also showed that superior performance during learning may not necessarily lead to better posttest performance.
iii
Study 2 investigated the effects of retrieval formats on flashcard learning. Retrieval format refers to the format in which vocabulary is practised in flashcard learning. Retrieval practice can be categorised into four types: receptive recall, productive recall, receptive recognition, and productive recognition. Study 2 showed that the use of a productive recall format may be particularly effective for the acquisition of knowledge of orthography although it may decrease learning phase performance. For the acquisition of form-meaning connections, recognition formats were found to be more desirable than recall.
Study 3 examined the effects of absolute and relative spacing on flashcard learning. Absolute spacing refers to the total amount of spacing that separates all repetitions of a given item. Relative spacing refers to how study opportunities are distributed relative to one another. Examples of relative spacing schedules include equal and expanding spacing. Study 3 found no significant difference between equal and expanding spacing in their posttest scores, suggesting that relative spacing may have little effect on learning. The main effect of absolute spacing, however, was significant. Massed learning, which led to the best learning phase performance, turned out to be the least effective on the posttests.
Study 4 investigated the effects of retrieval frequency and feedback timing on flashcard learning. Retrieval frequency refers to the number of retrieval attempts in
iv flashcard learning. The timing of feedback is concerned with when to provide feedback for retrieval. Feedback has been categorised into two types: immediate and delayed. The results suggested that it may be most desirable to practice retrieval five times. The advantage of repeated retrieval persisted 4 weeks after the treatment. Contrary to the predictions of the delay-retention effect, delaying feedback did not significantly increase learning.
Taken as a whole, the present thesis suggests that practising retrieval in a difficult and effortful condition (e.g., increasing spacing between encounters and using a demanding format such as productive recall) may enhance learning. The thesis also showed that learning phase performance may not necessarily be a good index of long-term retention. Pedagogically, the findings indicate that it may be useful to raise awareness that making mistakes during learning is not necessarily a sign of ineffective learning. The results from the four studies in this thesis are useful because they may allow us to make recommendations regarding the optimal way to learn from flashcards.
v ACKNOWLEDGEMENTS My special acknowledgment goes to my supervisor, Stuart Webb, who has continuously offered me invaluable advice and warm encouragement throughout this research project. Special thanks are due to Yosuke Sasao, Mike Rodgers, Tatsuhiko Matsushita, Shota Mukai, Frank Boers, Irina Elgort, Dalice Sim, Kirsten Reid, Sky Marsen, and other staff and colleagues at Victoria University of Wellington for their invaluable advice. I was extremely honoured to have Paul Nation, Rod Ellis, and Jan Hulstijn as thesis examiners.
I am also indebted to Hideo Oka, Hideo Suzuki, Kumiko Torikai, Masamichi Mochizuki, Tetsuchi Kajiro, Kazuya Saito, and Naoki Sakata for their support. I would also like to extend my special thanks to Atsushi Mizumoto, Tomohiro Tsuchiya, Tatyana Protsenko, and Myq Larson for their cooperation with data collection. I am also very grateful to those who participated in this research.
This research was supported by Faculty Research Grants and Victoria PhD Scholarship from Victoria University of Wellington and Student Exchange Support Program (Long-Term Study Abroad) Scholarship from Japan Student Services Organization. Parts of Chapter 1 have been published under the title ‘Computer-assisted second language vocabulary learning in a paired-associate paradigm: A critical investigation of flashcard software’ in Computer Assisted Language Learning, 24, 17-38. Earlier versions of Chapter 2 and Chapter 4 were
vi presented at the First Auckland Postgraduate Conference in Linguistics and Applied Linguistics, Auckland, 2011, and the 22nd Annual Conference of the European Second Language Association, Poznań, Poland, 2012, respectively. I am very grateful to the anonymous reviewers of the article and the audience at the conference for their invaluable comments and stimulating discussions. This research was approved by the Victoria University of Wellington Human Ethics Committee on 26 April, 2010.
Lastly, I would like to thank my parents, Saburo and Noriko, for their love and support.
vii
TABLE OF CONTENTS
ABSTRACT ...................................................................................................................ii ACKNOWLEDGEMENTS ........................................................................................... v TABLE OF CONTENTS .............................................................................................vii Chapter 1. INTRODUCTION .................................................................................... 1 1.1
Factors Affecting Flashcard Learning .............................................................. 5
1.2
Organisation of the Thesis ............................................................................... 8
Chapter 2. STUDY 1: BLOCK SIZE ....................................................................... 13 2.1
Review of Literature ...................................................................................... 14
2.1.1
Empirical studies on block size............................................................... 16
2.1.2
Limitations of previous research ............................................................. 19
2.2
Experiment 1A ............................................................................................... 26
2.2.1
Purpose .................................................................................................... 26
2.2.2
Pilot studies ............................................................................................. 26
2.2.3
Participants .............................................................................................. 27
2.2.4
Experimental design................................................................................ 28
2.2.5
Procedure ................................................................................................ 29
2.2.6
Target and filler words ............................................................................ 31
2.2.7
Treatment ................................................................................................ 33
viii 2.2.8
Spacing in the three groups ..................................................................... 39
2.2.9
Pretest ...................................................................................................... 47
2.2.10 Dependent measures ............................................................................... 48 2.2.11 Scoring .................................................................................................... 50 2.2.12 Results ..................................................................................................... 56 2.2.13 Discussion ............................................................................................... 63 2.3
Experiment 1B ............................................................................................... 69
2.3.1
Purpose .................................................................................................... 69
2.3.2
Participants .............................................................................................. 70
2.3.3
Experimental design................................................................................ 71
2.3.4
Method .................................................................................................... 74
2.3.5
Scoring .................................................................................................... 81
2.3.6
Results ..................................................................................................... 81
2.3.7
Discussion ............................................................................................... 96
Chapter 3. STUDY 2: RETRIEVAL FORMAT .................................................... 105 3.1
Review of Literature .................................................................................... 107
3.1.1
Effects of receptive and productive retrieval ........................................ 107
3.1.2
Effects of recall and recognition ........................................................... 109
3.2
The Present Study ........................................................................................ 122
3.3
Pilot Studies ................................................................................................. 123
3.4
Method ......................................................................................................... 126
3.4.1
Participants ............................................................................................ 126
ix 3.4.2
Experimental design.............................................................................. 126
3.4.3
Procedure .............................................................................................. 128
3.4.4
Treatment .............................................................................................. 129
3.4.5
Target words .......................................................................................... 137
3.4.6
Dependent measures ............................................................................. 141
3.4.7
Scoring .................................................................................................. 144
3.5
Results .......................................................................................................... 145
3.6
Discussion .................................................................................................... 165
3.7
Limitations ................................................................................................... 178
Chapter 4. STUDY 3: ABSOLUTE AND RELATIVE SPACING ...................... 181 4.1
Effects of Absolute Spacing ......................................................................... 185
4.1.1
The spacing effect and the lag effect..................................................... 185
4.1.2
Empirical evidence: Relationship among the ISI, RI, and retention ..... 188
4.2
Effects of Relative Spacing .......................................................................... 192
4.2.1
Theoretical background: Effects of equal and expanding spacing ....... 192
4.2.2
Empirical evidence: Effects of equal and expanding spacing............... 195
4.2.3
Effects of equal and expanding spacing on L2 vocabulary learning .... 202
4.3
The Present Study ........................................................................................ 210
4.4
Pilot Studies ................................................................................................. 212
4.5
Method ......................................................................................................... 212
4.5.1
Participants ............................................................................................ 212
4.5.2
Experimental design.............................................................................. 213
x 4.5.3
Procedure .............................................................................................. 216
4.5.4
Absolute and relative spacing schedules............................................... 216
4.5.5
Target and filler words .......................................................................... 219
4.5.6
Treatment .............................................................................................. 221
4.5.7
Pretest .................................................................................................... 225
4.5.8
Dependent measures ............................................................................. 225
4.5.9
Questionnaire ........................................................................................ 227
4.6
Results .......................................................................................................... 227
4.6.1
Effects of absolute spacing ................................................................... 232
4.6.2
Effects of relative spacing ..................................................................... 253
4.7
Discussion .................................................................................................... 265
4.7.1
Effects of absolute spacing ................................................................... 265
4.7.2
Effects of relative spacing ..................................................................... 272
4.8
Limitations ................................................................................................... 276
Chapter 5. STUDY 4: RETRIEVAL FREQUENCY AND FEEDBACK TIMING .................................................................................................................................... 279 5.1
Effects of Retrieval Frequency .................................................................... 281
5.2
Effects of Feedback Timing ......................................................................... 289
5.3
The Present Study ........................................................................................ 297
5.4
Pilot Studies ................................................................................................. 299
5.5
Method ......................................................................................................... 300
5.5.1
Participants ............................................................................................ 300
xi 5.5.2
Experimental design.............................................................................. 301
5.5.3
Target and filler words .......................................................................... 304
5.5.4
Procedure .............................................................................................. 308
5.5.5
Treatment .............................................................................................. 308
5.5.6
Pretest .................................................................................................... 314
5.5.7
Dependent measures ............................................................................. 314
5.5.8
Questionnaire ........................................................................................ 316
5.6
Results .......................................................................................................... 316
5.6.1
Effects of retrieval frequency................................................................ 317
5.6.2
Effects of feedback timing .................................................................... 334
5.7
Discussion .................................................................................................... 345
5.8
Limitations ................................................................................................... 353
Chapter 6. GENERAL DISCUSSION AND CONCLUSIONS............................ 355 6.1
Review of the Findings ................................................................................ 355
6.1.1
Study 1 .................................................................................................. 355
6.1.2
Study 2 .................................................................................................. 357
6.1.3
Study 3 .................................................................................................. 358
6.1.4
Study 4 .................................................................................................. 360
6.2
Overall Discussion ....................................................................................... 362
6.3
Pedagogical Implications ............................................................................. 374
6.4
Limitations and Further Research ................................................................ 376
6.5
Conclusion ................................................................................................... 378
xii Appendix A: Example of the Pretest (Studies 1, 3, and 4) ........................................ 381 Appendix B: Example of the Posttest (Studies 1, 3, and 4) ....................................... 383 Appendix C: Retrieval Cues Used in the Productive Pretest (Experiment 1B) ......... 385 Appendix D: Swahili-English Word Pairs Used in Study 2 ....................................... 386 Appendix E: Example of the Posttest (Study 2) ........................................................ 387 Appendix F: Retrieval Cues Used in the Productive Pretest (Study 3) ..................... 395 Appendix G: Retrieval Cues Used in the Productive Pretest (Study 4) ..................... 396 References .................................................................................................................. 397
xiii LIST OF ILLUSTRATIONS Figure 1. Design of Kornell’s (2009) Experiment 1. ........................................... 17 Figure 2. Item order and spacing in the 108A group (Crothers & Suppes, 1967, Experiment 9)............................................................................................... 22 Figure 3. Examples of the receptive (top) and productive (bottom) recall formats. English translations are provided on the right. ............................................ 35 Figure 4. Feedback for a correct response (top) and an incorrect response (bottom). English translations are provided on the right. ............................ 36 Figure 5. Item order in Experiment 1A. .............................................................. 41 Figure 6. Item order and spacing (ISI) in the BS 10 group. ................................ 44 Figure 7. Item order and spacing (ISI) in the BS 4 group. .................................. 46 Figure 8. Item order in Experiment 1B................................................................ 72 Figure 9. Design of Study 2. .............................................................................. 127 Figure 10. Retrieval formats in the four conditions. .......................................... 131 Figure 11. Examples of the four kinds of retrieval formats. .............................. 133 Figure 12. Feedback for a correct response (left) and an incorrect response (right) in the self-paced group.................................................................... 134 Figure 13. Examples of equal, expanding, and contracting spacing. ................ 182 Figure 14. Inverted U-shaped relationship between the ISI and retention in a hypothetical experiment (adapted from Cepeda et al., 2009, p. 242). ....... 187 Figure 15. Design of Study 3. ............................................................................ 215 Figure 16. Target items by item set. Underlined words are verbs while others are
xiv nouns. ......................................................................................................... 220 Figure 17. Feedback for a partially correct response. English translations are provided on the right. ................................................................................. 223 Figure 18. Number of correct responses on the productive posttest. Brackets enclose +1 SE. ............................................................................................ 243 Figure 19. Design of Butler et al.’s (2007) Experiment 2. ................................ 295 Figure 20. Design of Study 4. ............................................................................ 303 Figure 21. Target items by item set. Underlined words are verbs while others are nouns. ......................................................................................................... 306 Figure 22. Treatment in Study 4. ....................................................................... 309
xv
LIST OF ABBREVIATIONS BS DRE GTC ISI L1 L2 M OU RI SD SE SLA VST
BLOCK SIZE DELAY-RETENTION EFFECT TECHNICAL COLLEGE IN GIFU INTER-STIMULUS INTERVAL FIRST LANGUAGE SECOND LANGUAGE MEAN UNIVERSITY IN OSAKA RETENTION INTERVAL STANDARD DEVIATION STANDARD ERROR SECOND LANGUAGE ACQUISITION VOCABULARY SIZE TEST
xvi
1 Chapter 1. INTRODUCTION The purpose of this thesis is to investigate how we can optimise vocabulary learning from flashcards (word cards) in a second or foreign language (L2). Flashcards are a set of cards where the L2 word is written on one side and its meaning, usually in the form of a first language (L1) translation, L2 synonym, or L2 definition, is written on the other (e.g., Mondria & Mondria-de Vries, 1994; Nation & Webb, 2011, p. 11; Nation, 2001, p. 296). Flashcard learning is a type of paired-associate learning, where learners are asked to form connections between pairs of items (e.g., Steinel, Hulstijn, & Steinel, 2007; Thorndike, 1908).
The present thesis is motivated by several pedagogical and practical concerns. First, although paired-associate learning, including flashcard learning, tends to be dismissed as a relic of the old-fashioned behaviourist learning model, empirical studies demonstrate that it is effective, efficient, and useful. Vocabulary learnt in a paired-associate format is resistant to decay (Fitzpatrick, Al-Qarni, & Meara, 2008; Thorndike, 1908) and can be retained over several years (Bahrick, Bahrick, Bahrick, & Bahrick, 1993; Bahrick & Phelps, 1987). Studies have also shown that in a paired-associate learning task, large numbers of words can be learnt in a very short time (e.g., Fitzpatrick et al., 2008; Nation, 1980; Thorndike, 1908; W. B. Webb, 1962). Although contextual vocabulary learning is often considered superior to paired-associate learning (Folse, 2004, pp. 35–45; Hulstijn, 2001; Krashen, 1989; Laufer, 2003), research shows that paired-associate learning may be as effective as or
2 more effective than vocabulary learning from context (Griffin, 1992; Laufer & Shmueli, 1997; Pickering, 1982; Prince, 1996; Rodriguez & Sadoski, 2000; Seibert, 1930; S. A. Webb, 2007a; however, see Chun, Choi, & Kim, 2012, for an exception). Recent studies have also suggested that flashcard learning may transfer to normal language use and is a useful learning activity (Elgort, 2011; S. A. Webb, 2002, 2009a). Although flashcard learning may be only the initial step in mastering new vocabulary and should be complemented with meaning-focused activities such as extensive reading or listening, most researchers seem to agree that it should have a place in L2 vocabulary instruction (e.g., Hulstijn, 2001; Hunt & Beglar, 2005; Laufer, 2003, 2005; Nation & Webb, 2011, pp. 29–32; Nation, 2001, pp. 296–303, 2011; Schmitt, 2007, 2008). Given the effectiveness, efficiency, and usefulness of flashcard learning, it may be valuable to examine how we can optimise vocabulary learning from flashcards.
Second, both paired-associate learning and flashcard learning seem to be a common learning strategy. Hartwig and Dunlosky (2011), Karpicke, Butler, and Roediger (2009), and Wissman, Rawson, and Pyc (2012), for instance, surveyed American college students and found that 40% - 67.6% of them study with flashcards. Similarly, Schmitt (1997) surveyed 600 Japanese learners of English and reports that 51% of junior high school students, 29% of high school students, 12% of university students, and 10% of adults use flashcards for learning English vocabulary. Schmitt (1997) also found that list learning, another type of paired-associate learning, was used by 67%,
3 67%, 50%, and 33% of Japanese junior high school, high school, university students, and adults, respectively. (A list here refers to a sheet of paper where L2 words are printed along with their meanings, parts of speech, etc. Whereas an individual flashcard usually contains only one lexical item, a single word list often includes more than one lexical item.) Numerous computer-based flashcard programs are also available (Godwin-Jones, 2008, 2010; Nakata, 2011, 2012), and some of them have been used very widely. For instance, Quizlet, a flashcard website, has had more than 80 million visitors in the past year (Quizlet LLC, 2013). vTrain, a flashcard program, has been used by more than 50 universities and hundreds of schools worldwide (Rädle, 2008). In Yawata City in Kyoto, Japan, all the public junior high schools have incorporated into their English curriculum a flashcard program for Nintendo DS, a portable game player (Tamaki, 2007). Given the widespread use of flashcard and paired-associate learning, it seems useful to examine the optimal way to learn from flashcards.
Third, despite the increasing recognition that flashcard learning is effective, efficient, useful, and common, our understanding of the optimal way to learn from flashcards is still limited. For instance, previous studies have failed to identify the optimal block size, retrieval format, absolute spacing, relative spacing, retrieval frequency, and feedback timing in flashcard learning (see 1.1 for the definitions). By empirically examining the optimal way to learn from flashcards, the present thesis should contribute to improved performance using flashcards.
4
Fourth, research shows that many learners are not very proficient at flashcard learning. For instance, learners are often unaware that spacing increases learning (Hartwig & Dunlosky, 2011; Kornell, 2009; Son & Simon, 2012; Wissman et al., 2012), or retrieval practice, where learners are asked to recall the L2 word form or its meaning from memory, leads to superior long-term retention than mere presentation (Hagemeier & Mason, 2011; Hartwig & Dunlosky, 2011; Karpicke, Butler, et al., 2009; Karpicke, 2009; Kornell & Bjork, 2008). Learners also tend to overestimate their memory abilities and stop studying before lexical items are actually learnt (Karpicke, 2009; Kornell & Bjork, 2008). This thesis may improve learners’ ability to learn from flashcards by offering guidelines for successful flashcard learning.
Lastly, flashcard learning is considered superior to learning from a word list because flashcards may offer at least two benefits lacking in lists. First, retrieval practice may be implemented more easily with flashcards than word lists because in the former, the L2 word and its meaning are presented on different sides. Lists, by contrast, normally expose learners to both the L2 word and its meaning at the same time, making them a less desirable material than flashcards (Kornell & Bjork, 2008; Mondria & Mondria-de Vries, 1994; Nakata, 2008; Nation, 2001, p. 297). Second, flashcards offer more flexibility in the ordering of items than lists. This may enable the learner to review difficult or unknown items more frequently than easy or known items and avoid serial learning, where the position of the word in the list offers inappropriate
5 help in remembering (Mondria & Mondria-de Vries, 1994; Nakata, 2008; Nation, 2001, pp. 306–307). Considering that flashcard learning may be a more effective form of paired-associate learning than list learning, it seems useful to examine the optimal way to learn from flashcards.
1.1 Factors Affecting Flashcard Learning Previous studies suggest that the effects of paired-associate learning, including flashcard learning, may be affected by factors such as block size, retrieval formats, absolute spacing, relative spacing, retrieval frequency, feedback timing, direction of learning (receptive or productive), retrieval practice, context, interference, L1 translations, mnemonics, and pictures (see below). Among the above, this thesis examined the effects of the following six factors: block size, retrieval formats, absolute spacing, relative spacing, retrieval frequency, and feedback timing (see below for the justification for bringing these six factors into focus).
Block size refers to the number of words to be learnt at once (Crothers & Suppes, 1967, p. 142; Hulstijn, 2001). For instance, if 20 target words are repeated in five blocks of four items (e.g., Items 1, 2, 3, 4; 1, 2, 3, 4; 1, 2, 3, 4; 1, 2, 3, 4; … 17, 18, 19, 20; 17, 18, 19, 20), the block size is four. If 20 target words are repeated in a block of 20 items (e.g., Items 1, 2, 3, 4, 5, 6, 7, 8, … 17, 18, 19, 20; 1, 2, 3, 4, 5, 6, 7, 8, … 17, 18, 19, 20), the block size is 20.
6 Retrieval format refers to the format in which vocabulary is practised in flashcard learning. Retrieval practice can be categorised into four types: receptive recall, productive recall, receptive recognition, and productive recognition (Laufer, Elder, Hill, & Congdon, 2004; Laufer & Goldstein, 2004). In receptive recall, learners are asked to produce the meaning of target words while in productive recall, they produce the target word form corresponding to the meaning provided. Receptive recognition requires learners to choose, rather than to produce, the correct meaning of target words from a number of options, whereas productive recognition requires learners to choose the target word form corresponding to the meaning provided.
Absolute spacing refers to the total amount of spacing that separates all repetitions of a given item (Karpicke & Bauernschmidt, 2011). For instance, if a given item is encountered four times, and each encounter is separated by 2 minutes, absolute spacing is 6 minutes (2 minutes x 3). Relative spacing refers to how study opportunities are distributed relative to one another (Karpicke & Bauernschmidt, 2011). Examples of relative spacing schedules include equal and expanding spacing (e.g., Karpicke & Roediger, 2007a; Landauer & Bjork, 1978; Logan & Balota, 2008; Storm, Bjork, & Storm, 2010). In equal spacing, the intervals between encounters of a given item are held constant (e.g., 2, 2, and 2 minutes). In expanding spacing (also known as expanding or expanded rehearsal; Ellis, 1995), the intervals between encounters are gradually increased (e.g., 1, 2, and 3 minutes; see 4.2 for details).
7 Retrieval frequency refers to the number of retrieval attempts in flashcard learning. For instance, if learners practise retrieval five times, the retrieval frequency is five. The timing of feedback is concerned with when to provide feedback for retrieval. Feedback has been categorised into two types: immediate and delayed (e.g., Butler, Karpicke, & Roediger, 2007; Kulik & Kulik, 1988; Metcalfe, Kornell, & Finn, 2009). The former is typically given immediately after each response, whereas the latter is provided after a number of items or a period of time.
The present thesis investigated the effects of the above six factors for two reasons. First, an extensive body of literature has examined factors such as the direction of learning, retrieval practice, context, interference, or L1 translations and produced consistent results, leaving little room for further research. For instance, previous studies have consistently demonstrated that (a) receptive retrieval promotes large gains in receptive vocabulary knowledge but smaller gains in productive knowledge, whereas productive retrieval leads to relatively large gains in receptive knowledge as well as large gains in productive knowledge (Griffin & Harley, 1996; Mondria & Wiersma, 2004; Steinel et al., 2007; Stoddard, 1929; Waring, 1997a; S. A. Webb, 2009a, 2009b; see 3.1.1), (b) retrieval practice leads to superior long-term retention than mere presentation (Barcroft, 2007; Cull, 2000; Karpicke & Roediger, 2007b, 2008; Karpicke, 2009; Landauer & Bjork, 1978; McNamara & Healy, 1995; Metcalfe & Kornell, 2007; Royer, 1973), (c) context may have no or little effect on L2 paired-associate learning (Griffin, 1992; Laufer & Shmueli, 1997; Pickering, 1982;
8 Prince, 1996; Rodriguez & Sadoski, 2000; Seibert, 1930; S. A. Webb, 2007a), (d) learning semantically related words simultaneously causes interference and inhibits learning of unfamiliar vocabulary (e.g., Erten & Tekin, 2008; Finkbeiner & Nicol, 2003; Tinkham, 1993, 1997; Waring, 1997b), and (e) use of L1 translations facilitates vocabulary learning (Lado, Baldwin, & Lobo, 1967; Laufer & Shmueli, 1997; Mishima, 1967). Prior studies, however, have failed to identify the optimal block size, retrieval format, absolute spacing, relative spacing, retrieval frequency, and feedback timing, and further research on these six factors is warranted (see Chapters 2-5 for details). Second, studies on the above six factors will be of great pedagogical value. Use of mnemonics (e.g., Ellis & Beaton, 1993; Hulstijn, 1997; Lado et al., 1967; Levin, McCormick, Miller, Berry, & Pressley, 1982; McDaniel & Pressley, 1984; Rodriguez & Sadoski, 2000; Wang & Thomas, 1995) or pictures (e.g., Carpenter & Olson, 2012; Deno, 1968; Kopstein & Roshal, 1954; Lado et al., 1967; Webber, 1978) may be effective, but not all words are easily learnt using these methods. In contrast, block size, retrieval formats, absolute spacing, relative spacing, retrieval frequency, and feedback timing may affect the learning of any types of words, and studies on these factors are expected to have a wider range of application than those on others. For the above two reasons, the effects of the above six factors were examined in this thesis.
1.2 Organisation of the Thesis The present thesis consists of four studies, each of which investigated the effects of
9 different factors: block size (Study 1), retrieval formats (Study 2), absolute spacing (Study 3), relative spacing (Study 3), retrieval frequency (Study 4), and feedback timing (Study 4). Because different issues are addressed in each study, each study has a separate literature review (Griffin, 1992; S. A. Webb, 2002).
Study 1 (Chapter 2) examined the effects of block size on L2 vocabulary learning. Previous studies have shown that a large block size yields superior learning to a small one (Brown, 1924; Crothers & Suppes, 1967; Kornell, 2009; McGeoch, 1931; Seibert, 1932). Existing studies, however, are limited in that block size and spacing were confounded. More specifically, in previous studies, a large block size always had longer spacing than a small block size. Study 1 set out to investigate the effects of block size on L2 vocabulary learning in a more rigorous manner than existing studies by manipulating spacing as well as block size. The results of Study 1 may allow us to determine what block size should be used in order to optimise L2 vocabulary learning from flashcards.
Study 2 (Chapter 3) investigated the effects of retrieval formats on flashcard learning. The following four conditions were compared: recognition, recall, combined, and highest difficulty. In the recognition condition, target items were practised in receptive and productive recognition formats, whereas the recall condition consisted of receptive and productive recall. In the combined condition, target items were studied in receptive recognition, productive recognition, receptive recall, and productive
10 recall. In the highest difficulty condition, target items were learnt only in the most demanding format, productive recall. By comparing the effectiveness of the above four conditions, Study 2 may enable us to identify the optimal retrieval format for flashcard learning.
Study 3 (Chapter 4) examined the effects of absolute and relative spacing on L2 vocabulary learning. Participants were assigned to one of the four absolute spacing groups: massed, short, medium, and long spacing. In the massed group, target items were repeated without any spacing. The short, medium, and long spacing groups used absolute spacing of roughly 3, 6, and 18 minutes, respectively. Relative spacing was manipulated within participants, and equal and expanding spacing were compared. Findings of Study 3 may allow us to determine what kinds of absolute and relative spacing schedules should be used in order to optimise flashcard learning.
Study 4 (Chapter 5) investigated the effects of retrieval frequency and feedback timing on flashcard learning. Participants were assigned to one of the four retrieval frequency levels: one, three, five, and seven. The timing of feedback was manipulated within participants, and immediate and delayed feedback were compared. In the former, feedback was provided immediately after each retrieval attempt. In the latter, feedback was withheld until all target items were practised. Findings of Study 4 may be useful because they may enable us to identify the optimal retrieval frequency and feedback timing for flashcard learning.
11
The final chapter (Chapter 6) summarises the findings of the four studies in this thesis and discusses the optimal way to learn from flashcards. The chapter also presents the limitations of the present thesis and discusses directions for further research.
12
13 Chapter 2. STUDY 1: BLOCK SIZE Study 1 examined the effects of block size (hereafter, BS) on L2 vocabulary learning. BS refers to the number of words to be learnt at once (Crothers & Suppes, 1967, p. 142; Hulstijn, 2001). For instance, suppose we have 20 words to study. Would it be more effective to learn all 20 words at one time or would it be more effective to divide them into smaller decks? The retrieval practice effect (Baddeley, 1997, p. 112; Ellis, 1995) and the list-length effect (Gillund & Shiffrin, 1984; Van Bussel, 1994) suggest that a small BS is more effective. Researchers, teachers, learners, and materials developers also seem to believe that a small BS may increase learning more than a large one (see 2.1).
Contrary to the view that a small BS may facilitate learning, empirical studies have shown that a large BS may be more effective (Brown, 1924; Crothers & Suppes, 1967, Experiments 8 & 9; Kornell, 2009, Experiments 1-3; McGeoch, 1931, Experiments 1 & 3; Seibert, 1932). Existing studies, however, are limited in that BS and spacing were confounded. More specifically, in previous studies, a large BS always had longer spacing than a small BS. The confounding of BS and spacing is problematic because larger spacing generally leads to better long-term retention than shorter spacing (lag effect; see 4.1.1). In other words, the results of the earlier studies may be at least partly attributed to spacing rather than BS per se.
Study 1 set out to investigate the effects of BS on L2 vocabulary learning in a more
14 rigorous manner than existing studies by manipulating spacing as well as BS. Study 1 consisted of two experiments: Experiments 1A and 1B. Experiment 1A compared the effects of BSs of four, 10, and 20 words. Unlike previous studies, the three BSs were matched in spacing. Experiment 1B attempted to examine the relative importance of BS and spacing on flashcard learning by comparing the following three treatments: a BS of 20 words, a BS of four words with shorter spacing than the BS of 20 words, and a BS of four words with equivalent spacing as the BS of 20 words. The results of this study may allow us to determine what BS should be used in order to optimise L2 vocabulary learning from flashcards.
2.1 Review of Literature The retrieval practice effect and the list-length effect suggest that a small BS is more effective than a large BS. The retrieval practice effect (Baddeley, 1997, p. 112; Ellis, 1995) refers to the phenomenon where a successful retrieval from memory yields superior long-term retention to an unsuccessful retrieval (see Finley, Benjamin, Hays, Bjork, & Kornell, 2011; Karpicke & Bauernschmidt, 2011; Karpicke & Roediger, 2010; Logan & Balota, 2008; Modigliani, 1976; Storm et al., 2010, for a similar discussion). A small BS may lead to a higher level of retrieval success than a large one because in the former, target items are encountered after a shorter interval compared with the latter. For instance, when the BS is 100, encounters of a given item are separated by 99 other items, whereas when the BS is four, only three items intervene between the encounters of a given item. As a result, when a small BS is used, retrieval
15 practice may take place before forgetting occurs, possibly producing higher retrieval success than a large BS. According to the retrieval practice effect, therefore, a small BS may be more effective because learners are more likely to benefit from the positive effects of retrieval success. The list-length effect also predicts an advantage of a small BS over a large one. According to this effect, memory performance is inversely related to the number of items to be studied (Gillund & Shiffrin, 1984; Van Bussel, 1994). In other words, when 10 words are studied at one time, 60% of them may be learnt successfully, whereas when 40 words are studied, only 40% of them may be learnt (Gillund & Shiffrin, 1984). The list-length effect also suggests that a small BS may lead to superior long-term retention.
Some researchers also claim that a relatively small BS, such as that of 10 - 12 items, may enhance learning more than a large one (Joseph, Watanabe, Shiung, Choi, & Robbins, 2009; Salisbury & Klein, 1988; Waring, 2004). Learners, teachers, and materials developers also seem to believe that a small BS is more effective (Kornell, 2009; Salisbury & Klein, 1988; Wissman et al., 2012; Woodworth & Schlosberg, 1954). Kornell (2009), for instance, discovered that learners perceived a BS of five words to be more effective than that of 20. Wissman et al. (2012) surveyed 374 American college students and found that 72.2% of them considered a small BS to be more effective, whereas only 16.3% of them responded that a large BS might be superior. Woodworth and Schlosberg (1954) also note that the use of a small BS may be a common school practice as exemplified by the saying ‘One thing at a time and
16 that done well’ (p. 782). In addition, Kornell (2009) points out that a major study guide for the Graduate Record Examination (GRE) recommends using a small BS. Some computer-based flashcards use a relatively small BS as well. Nakata (2011) surveyed nine commercially or freely available flashcard programs and found that two of the nine programs limit the BS up to 10. More specifically, iKnow uses BSs of five and 10 words, and Word Engine uses a BS of 10 words. (The other seven programs surveyed are flexible about the BS, and users can study with either a small or large BS.) The use of a small BS among existing flashcard software may reflect the view that a small BS may increase learning more than a large one (Joseph et al., 2009). As the above discussion suggests, a small BS generally seems to be regarded as more conducive to learning.
2.1.1 Empirical studies on block size Contrary to the view that a small BS may facilitate learning, most empirical studies have shown that a large BS may be more effective. Kornell (2009), for instance, compared the effectiveness of BSs of five and 20 words in three experiments. In his experiments, American undergraduate students studied low-frequency English words paired with higher frequency synonyms (e.g., abrogate - abolish). Figure 1 summarises the design of his Experiment 1. In his first experiment, in the BS 20 condition, 20 target word pairs were repeated in a block of 20 items and encountered four times throughout the treatment (Figure 1, left). In the BS 5 condition, the target items were encountered four times in four blocks of five items (Items 1 - 5, 6 - 10, 11
17 - 15, and 16 - 20; Figure 1, right).
Note. ‘1-5 x 4’ indicates that Items 1 to 5 were studied four times in a block of five items (i.e., Items 1, 2, 3, 4, 5; 1, 2, 3, 4, 5; 1, 2, 3, 4, 5; 1, 2, 3, 4, 5). Figure 1. Design of Kornell’s (2009) Experiment 1.
On the posttest conducted 1 day after the treatment, the BS 20 condition significantly outperformed the BS 5 condition. Kornell also found the advantage of the BS 20 condition over the BS 5 condition in Experiments 2 and 3. The three experiments conducted by Kornell demonstrate that a large BS may be more effective than a small one. Six experiments have supported Kornell’s (2009) findings (Brown, 1924; Crothers & Suppes, 1967, Experiments 8 & 9; McGeoch, 1931, Experiments 1 & 3; Seibert, 1932). In Crothers and Suppes’s (1967) Experiments 8 and 9, for instance, American university students studied English-Russian word pairs. In Experiment 8, a BS of 108 words fared significantly better than BSs of 18 and 36 words. In Experiment 9, a BS of 216 words significantly outperformed a BS of 108 words.
Three experiments, however, failed to find any advantage of a large BS over a small one (Crothers & Suppes, 1967, Experiments 10 & 11; Van Bussel, 1994). In Van Bussel (1994), 12 speakers of Dutch (20-30 years old) studied 40 Dutch-English word
18 pairs. Unlike other studies, Van Bussel found the superiority of a small BS (20) over a large one (40). The contradictory results may be ascribed in part to the number of encounters with target items during learning. Whereas the target items were encountered more than once during the treatment in all other earlier studies (Brown, 1924; Crothers & Suppes, 1967; Kornell, 2009; McGeoch, 1931; Seibert, 1932), the target items were encountered only once in Van Bussel (1994). A more detailed explanation of why the inconsistent findings might have stemmed from this difference will be offered in 2.1.2.
Although Crothers and Suppes (1967) found the advantage of a large BS in Experiments 8 and 9 (see above), they failed to do so in Experiments 10 and 11. In Experiment 10, 39 American university students studied 300 English-Russian word pairs using a BS of 100 or 300 words. No statistically significant difference existed between the two BSs in their posttest scores. In Experiment 11, 48 American university students studied 72 English-Russian word pairs in three conditions: BSs of 18, 36, and 72 words. The BS of 18 words led to the highest posttest score (M = 50.4%), followed by the BSs of 36 (M = 46.4%) and 72 (M = 44.7%) words. However, Crothers and Suppes did not carry out statistical analysis, and it is not clear whether the differences were statistically significant.
The results of Crothers and Suppes’s (1967) Experiments 10 and 11 were inconsistent with those of their earlier experiments. The contradictory results may be partially
19 attributed to a possible difference in task difficulty (Nation, 1982, 2001, p. 305). The treatments in Experiments 10 and 11 were more demanding than in their previous experiments in at least two respects. First, in Experiments 8 and 9, the target words were practised in a receptive recognition format, where participants were presented with a Russian word and asked to select the most appropriate English translation from three options. Experiments 10 and 11, in contrast, used a recall format, and participants were asked to produce, rather than to choose, the meaning of target words. As recall may be more demanding than recognition (e.g., Butler & Roediger, 2007; Kang, McDermott, & Roediger, 2007; Laufer et al., 2004; Laufer & Goldstein, 2004; Nation, 1982, 2001, p. 305), difficulty might have been higher in Experiments 10 and 11 compared with their earlier experiments. Second, the treatments in Experiments 8 and 9 were self-paced, and participants were allowed to take as much time as they needed to respond. The treatments in Experiments 10 and 11, in contrast, were paced by the experimenter, and participants were required to write down a response within 5 seconds. This might have also increased task difficulty (Nation, 1982, 2001, p. 305). Based on the above two differences, Nation (1982, 2001, p. 305) points out that a large BS may be effective only when difficulty is low. This may be partly the reason why Crothers and Suppes found the superiority of a large BS in their earlier experiments, but not in Experiments 10 and 11.
2.1.2 Limitations of previous research Even though the findings of the previous studies are very valuable, they may suffer
20 from at least two limitations. First, in all existing studies, BS was confounded with spacing, and a large BS had longer spacing than a small BS. Let me illustrate this point by using Kornell’s (2009) Experiment 1 as an example. His first experiment compared BSs of five and 20 words. One common index of spacing used in previous studies is the average inter-stimulus interval (hereafter ISI; e.g., Balota, Duchek, Sergent-Marshall, & Roediger, 2006; Carpenter & DeLosh, 2005; Cull, Shaughnessy, & Zechmeister, 1996; Karpicke & Roediger, 2007a; Landauer & Bjork, 1978; Logan & Balota, 2008; Pyc & Rawson, 2007). An ISI refers to an interval that separates repetitions of a given item. For instance, in the BS 20 condition in Kornell (2009, Experiment 1), encounters of a given item were separated by 19 trials for other items (Figure 1, left), and the target items were encountered four times throughout the treatment.1 Hence, the BS 20 condition had an average ISI of 19 trials ([19 + 19 + 19] / 3 = 19) between repetitions. In contrast, in Kornell’s BS 5 condition, only four trials intervened between the encounters of a given item (Figure 1, right), and the condition had a mean ISI of four trials ([4 + 4 + 4] / 3 = 4). The BS 20 condition, hence, had longer spacing (mean ISI = 19 trials) than the BS 5 condition (mean ISI = 4 trials).2
The confounding of BS and spacing is problematic because larger spacing generally leads to better long-term retention than shorter spacing (lag effect; see 4.1.1). In other 1
A trial refers to one study opportunity for a given item. For instance, if there are 20 target items, and each item is encountered four times, there are 80 (20 x 4) trials in total. 2 In this example, the number of trials is used as an index of spacing. Another commonly used unit of spacing is time (e.g., Cull, Shaughnessy, & Zechmeister, 1996, Experiments 1-3; Karpicke & Roediger, 2007a; Landauer & Bjork, 1978; Pavlik & Anderson, 2008; Pyc & Rawson, 2007). This will be explained in detail in 2.2.8.
21 words, the results of the previous research may be at least partly attributed to spacing rather than BS per se. BS was confounded with spacing in all existing studies (Brown, 1924; Crothers & Suppes, 1967; Kornell, 2009; McGeoch, 1931; Seibert, 1932) except Van Bussel (1994).3 As noted already, in Van Bussel’s experiment, each target item was encountered only once during the treatment. Because the target items were not repeated, there was no spacing, and BS was not confounded with spacing in his study. As a result, a small BS might have significantly outperformed a large one in Van Bussel (1994) unlike in other studies.
Although Van Bussel’s (1994) findings may be useful, his study may have limited pedagogical value because multiple exposures to target words may be common in a real-life study situation. Thus, it may be beneficial to conduct research where (a) BS is not confounded with spacing, and (b) target items are encountered more than once. Crothers and Suppes’s (1967) Experiment 9 is noteworthy because it suggests how such research can be conducted. In their experiment, 36 American university students studied 216 English-Russian word pairs. The participants were assigned to one of the following three groups: 108, 108A, and 216. In the 216 group, the target items were repeated in a block of 216 items and encountered 12 times throughout the treatment. In other words, there were 12 cycles of 216 items, and each item was encountered only once in each cycle (i.e., 1 - 216 x 12). In the 108 group, the 216 target items were
3
Several previous studies are aware of and explicitly mention this confound (Kornell, 2009; McGeoch, 1931; Van Bussel, 1994; Woodworth & Schlosberg, 1954). Kornell (2009), for instance, argues that using a large BS may be an effective strategy because it helps to introduce a large amount of spacing between encounters.
22 repeated in two blocks of 108 items (Items 1 - 108 and 109 - 216) and encountered 12 times (i.e., 1 - 108 x 12; 109 - 216 x 12). In the 108A group, the target items were also divided into two blocks of 108 items as in the 108 group. However, in the 108A group, the first block was studied before and after the second block and vice versa. More specifically, items in the first block were studied only six times instead of 12 at the beginning of the treatment, which was followed by six encounters with the second block. After that, the first block was practised six times again, and the second block was also studied six times (i.e., 1 - 108 x 6; 109 - 216 x 6; 1 - 108 x 6; 109 - 216 x 6; Figure 2). On the posttest, the 216 group (M = 94.5%) significantly outperformed the 108 group (M = 88.9%), demonstrating the superiority of a large BS over a small one. However, no statistically significant difference existed between the 108A group (M = 92.5%) and the other two groups.
Figure 2. Item order and spacing in the 108A group (Crothers & Suppes, 1967, Experiment 9).
Let us now see how much spacing separated the repetitions of target words in the three groups. In the 108 group, repetitions of a given item were separated by 107 trials on average, whereas in the 216 group, they were separated by 215 trials. As a result,
23 the 108 and 216 groups had average ISIs of 107 and 215 trials, respectively. In Crothers and Suppes (1967), the item order was randomised for each cycle. Therefore, items were studied in a random order such as Items 8, 13, 3, 15, 17, 1, 14, … and 201, for instance, rather than in a fixed order such as Items 1, 2, 3, 4, 5, 6, 7, … and 216. Note that although randomisation of the item order changed ISIs for individual items, it did not affect the average ISI as a whole because possible differences from the mean ISI were cancelled out across items in the same cycle. For instance, if one item in a given cycle had an ISI of 117 (107 + 10) trials in the 108 group, another item in the same cycle had an ISI of 97 (107 - 10) trials. In this way, possible differences from the mean ISI were cancelled out, and the average ISI remained 107 and 215 trials for the 108 and 216 groups, respectively, even after randomisation.
Spacing in the 108A group can be calculated based on Figure 2. As illustrated in the figure, the treatment in the 108A group consisted of 24 cycles of 108 items. In Figure 2, the column with the heading ISI indicates the average ISI for all items in a given cycle. For instance, the first and second encounters of Items 1 - 108 were separated by 107 trials on average. Therefore, 107 is given for the ISI column in Cycle 1. Note that the sixth (Cycle 6) and seventh (Cycle 13) encounters of Items 1 - 108 were intervened by Cycles 7 - 12. Hence, 755 (107 + 108 x 6 = 755) is given for the ISI column in Cycle 6. Similarly, 755 is given for the ISI column in Cycle 12 because the sixth (Cycle 12) and seventh (Cycle 19) encounters of Items 109 - 216 were separated by Cycles 13 - 18.
24
The mean ISI in the 108A group can be calculated by averaging all ISIs in Figure 2, that is, (107 x 20 + 755 x 2) / 22 = 165.9 trials. In other words, although the 108A group used the same BS as the 108 group, it had longer spacing (mean ISI = 165.9 trials) than the latter (mean ISI = 107 trials). At the same time, the ISI in the 108A group was still shorter than that of the 216 group (mean ISI = 215 trials). This may be partly the reason why the 108A group fared slightly better than the 108 group but was not as effective as the 216 group. Crothers and Suppes’s (1967) Experiment 9 is significant because it suggests that it may be possible to manipulate spacing while holding the BS constant. One limitation of their experiment, though, is that the 108A and 216 groups were not completely matched in spacing. If spacing had been manipulated so that the two groups would have exactly the same average ISIs, the 108A group might have been as effective as or more effective than the 216 group. Expanding on Crothers and Suppes’s (1967) Experiment 9, the present study compared different BSs with equivalent spacing. By isolating the effects of BS and spacing, the current study may allow us to determine how BS may influence flashcard learning in a more rigorous manner than earlier studies.
Another limitation of the previous studies may be that some of them did not control the lag to test. Lag to test refers to an interval between the last encounter with a given item and the test and is shown to affect memory performance (e.g., Anderson & Jordan, 1928; Bahrick, 1984; Cepeda, Vul, Rohrer, Wixted, & Pashler, 2008; Metcalfe
25 et al., 2009; Rawson & Dunlosky, 2011; Rohrer, Taylor, Pashler, Wixted, & Cepeda, 2005; Seibert, 1927, 1930, 1932). For instance, if a test is given 24 hours after the last encounter with a given item (lag to test is 24 hours) instead of 1 hour after (lag to test is 1 hour), memory performance will naturally be worse. In some previous studies, a small BS was associated with greater lag to test than a large BS. For instance, in Kornell’s (2009, Experiment 1) BS 5 condition, the trials for the first two blocks (Items 1 - 5 and 5 - 10) were clustered at the beginning of the treatment (Figure 1, right), whereas in the BS 20 condition, they were distributed evenly throughout the treatment (Figure 1, left). Due to the rather long interval to the posttest, the first several items in the BS 5 condition might have been forgotten by the time of the posttest. This may be partly the reason why the BS 5 condition proved less effective than the BS 20 condition.
Although BS was confounded with lag to test in most existing studies (Crothers & Suppes, 1967, Experiments 8-11; Kornell, 2009, Experiments 1 & 2; McGeoch, 1931, Experiment 1; Seibert, 1932), three experiments (Brown, 1924; Kornell, 2009, Experiment 3; McGeoch, 1931, Experiment 3) constitute exceptions. By adding the final review at the end of the treatment, these three experiments controlled the lag to test in different BSs. For instance, in Kornell’s (2009) Experiment 3, all 20 target items were studied twice at the end of the treatment in both BS 5 and 20 conditions. As a result, both conditions were controlled for lag to test. The three experiments that controlled the lag to test, nonetheless, found the advantage of a large BS over a small
26 one (Brown, 1924; Kornell, 2009, Experiment 3; McGeoch, 1931, Experiment 3). This may be because despite being matched in lag to test, a small BS still had shorter spacing than a large BS. The results seem to suggest that although lag to test may affect learning to some extent, its effects may not be as large as those of spacing.
2.2 Experiment 1A 2.2.1 Purpose As the above review of literature reveals, most previous studies have indicated that a large BS may be more effective than a small one. One limitation of the earlier studies, however, may be that BS was confounded with spacing. Thus, the results of the previous studies may be at least partly attributed to spacing rather than BS per se. In order to isolate the effects of BS and spacing, Experiment 1A compared three BSs (BSs of four, 10, and 20 words) that were matched in spacing. By controlling spacing, this study attempted to investigate the effects of BSs in a more rigorous manner than existing studies. Findings of this study may allow us to determine what BS should be used in order to optimise L2 vocabulary learning from flashcards. The research question of this experiment is as follows: Is a large BS more effective than a small one when spacing is equivalent?
2.2.2 Pilot studies Pilot studies were conducted with 10 Japanese learners of English in Japan and New Zealand to identify any potential problems with the methodology of this experiment.
27 No major problem was found in the pilot studies.
2.2.3 Participants The original pool of participants consisted of 132 first-year Japanese students at a university in Osaka, Japan. Out of the 132 students, 37 were dropped because they were absent on the day of the experiment, chose not to participate, or their data were lost due to due to technical problems, leaving 95 students. A further four students were excluded from analysis because they demonstrated prior knowledge of one or more target words on the pretest (see 2.2.9). The remaining 91 students consisted of 20 Engineering, 31 Commerce, and 40 Law majors. Their average score on the first to the sixth 1,000-word frequency levels of the Vocabulary Size Test (VST: Nation & Beglar, 2007) was 33.92 (SD = 6.42) out of 60. The participants were randomly assigned to the BS 4, 10, and 20 groups so that there would be no significant difference in the VST scores, F (2, 90) = 0.22, p = .802, η2 < .01. This result is also supported by the effect size (η2 < .01), which is regarded as having no effect (Cohen, 1988). The BS 4, 10, and 20 groups consisted of 28, 30, and 33 participants, respectively. The imbalance in the number of participants among the three groups was caused by the absence of participants. As the students who demonstrated knowledge of target words on the pretest were dropped from analysis, it is assumed that the three groups did not differ from one another in terms of their knowledge about the form and meaning of the target words at the outset of the experiment. The three groups also had a roughly equal number of participants from each of the three majors: Engineering,
28 Commerce, and Law (Table 1).
Table 1 VST Scores and Majors of Participants VST scores Groups n M SD Engineering BS 4 28 33.43 6.97 6 BS 10 30 34.53 6.67 7 BS 20 33 33.79 5.85 7 Note. The maximum score is 60 for the VST.
Majors Commerce 9 10 12
Law 13 13 14
2.2.4 Experimental design There were two independent variables in the current experiment. The first independent variable was BS: four, 10, and 20 words. The second independent variable was the retention interval (interval between the treatment and posttest; hereafter, RI): immediate and 1-week delayed posttests. The present experiment employed a mixed design. BS was a between-participant variable, and the RI was a within-participant variable. The dependent variables were effectiveness and efficiency of the three groups. Effectiveness was measured by the number of correct responses on the posttest (see 2.2.10). Efficiency was defined as the number of words acquired per minute and calculated by dividing the posttest score by the study time (e.g., Kornell, 2009; Mondria, 2003; Pyc & Rawson, 2007).
BSs of four, 10, and 20 words were compared in this experiment. These three BSs were chosen for four reasons. First, previous studies have found that a large BS is more effective than a small one when BSs being compared are 20 or smaller (Brown,
29 1924; Kornell, 2009, Experiments 1-3; McGeoch, 1931, Experiments 1 & 3; Seibert, 1932). It was judged, therefore, that the use of these three BSs might provide statistically significant results. Second, research suggests that people can memorise around seven chunks of information by rote (the magical number seven; Miller, 1956). However, this may apply only to letters, numbers, or already known words (Cowan, 2001; Jefferies, Lambon Ralph, & Baddeley, 2004), and the number is expected to be less for unknown L2 words (Baddeley, 1997, pp. 29-31). Thus, it may be useful to compare a BS of less than seven words and those of more. Third, using relatively small BSs may increase ecological validity because researchers, teachers, learners, and materials developers seem to believe that a small BS may enhance learning more than a large one (see 2.1), possibly encouraging learners to use a small BS. Fourth, using relatively small BSs may also reflect authentic computer-based learning because some computer-based flashcards limit the BS up to 10 (Nakata, 2011; see 2.1). Based on the above four reasons, BSs of four, 10, and 20 words were compared in the current experiment.
2.2.5 Procedure The experiment was conducted with a computer program developed by the author with Microsoft Visual Basic for Excel Version 7.0. The experiment consisted of three sessions. In Session 1, participants received explanations about the study. Session 2 was comprised of the practice period, pretest, treatment, filler task, immediate posttest, and questionnaire. In Session 3, the delayed posttest was administered. There was an
30 interval of 1 week between Sessions 1 and 2, and 2 and 3.
Session 1 One week before Session 2, students received explanations about the study. They were also given participant information sheets and asked to sign consent forms if they chose to participate.
Session 2 At the beginning of Session 2, participants received explanations about the flashcard program and practised using it with three sample words. After the practice, the pretest was administered to test the participants’ prior knowledge of the target words (see 2.2.9). The pretest was followed by the treatment, where participants studied 23 English-Japanese word pairs (including three filler items; see 2.2.6) using a flashcard program. Target items were studied using a different BS (four, 10, or 20 words) depending on the group to which participants were assigned. After the treatment, participants answered 10 two-digit additions (e.g., 53 + 49 = ?) as a filler task. This task was included to ensure that the posttests would measure learning rather than be a function of primary memory. The immediate posttest was given after the filler task. Participants took two types of posttests: productive recall and receptive recall (see 2.2.10). After the immediate posttest, a questionnaire was administered in Japanese. Participants were asked to indicate what they considered the optimal BS to be when studying 20 English words by choosing a number from 1 to 20.
31
Session 3 In order to measure retention, the delayed posttest was administered 1 week after the treatment. As in the immediate posttest, the following two types of tests were given: productive recall and receptive recall. The delayed posttest was administered without prior notice so that participants would not review the target words during the period between the treatment and delayed posttest.
2.2.6 Target and filler words Twenty English-Japanese word pairs were used as target items. Target items were selected based on the following criteria: (1) Items were chosen from low frequency English words because they needed to be unfamiliar to participants. More specifically, words that are outside the most frequent 9,000 word families in the British National Corpus (BNC) frequency lists (Nation, 2004, 2006) were chosen as target items. Although this may mean that some target words might be arcane and potentially pose a threat to the ecological validity of the experiment, it was decided to choose methodological rigor over ecological validity. (2) Although there may be some exceptions (Laufer, 1997), previous studies have found that short words are easier to remember than long ones, a phenomenon known as the word length effect (e.g., Baddeley, Thomson, & Buchanan, 1975; Jalbert, Neath, Bireta, & Surprenant, 2011; Watkins, 1972). Therefore, relatively
32 short English words were selected in order to lower the learning burden. The average number of letters of the 20 target words was 5.30 (SD = 1.38). (3) Out of the 20 items, 12 were nouns and eight were verbs, following the 6:4 ratio of nouns to verbs in natural text (e.g., S. A. Webb, 2005, 2009b).
Table 2 presents English-Japanese word pairs used in the present experiment. Billow, gouge, pique, scowl, and warble can be used as both a noun and verb. In Table 2, billow is classified as a noun because the Japanese translation うねり corresponds to the noun meaning of the word. (The translation for the verb meaning would be うね る.) Similarly, gouge, pique, scowl, and warble are categorised as verbs in Table 2 because Japanese translations for the verb meaning were used in this study.
Table 2 English-Japanese Word Pairs Used in Experiment 1A English
Japanese
POS
BNC
English
Japanese
POS
BNC
Target
apparition
幽霊
N
14
loach
ドジョウ
N
off-list
items
billow
うねり
N
12
mane
たてがみ
N
11
cadge
ねだる
V
15
mirth
陽気
N
11
citadel
砦
N
12
nadir
どん底
N
10
dally
ふざける
V
15
pique
怒らせる
V
14
fawn
へつらう
V
12
quail
うずら
N
12
fracas
けんか
N
10
rue
後悔する
V
10
gouge
彫る
V
11
scowl
にらむ
V
11
grig
コオロギ
N
off-list
toupee
かつら
N
16
levee
堤防
N
18
warble
さえずる
V
15
Filler
husk
皮
N
12
smudge
よごす
V
12
items
polemic
論争
N
13
Note. POS = part of speech; BNC = BNC frequency level based on Nation (2004, 2006).
Three filler items (husk, polemic, and smudge) were also chosen based on the same
33 criteria as the target items. Filler items were studied and tested like target word pairs, but were excluded from analysis. Otherwise, they were treated in exactly the same way as target items, and participants were not informed that filler items would be used. The filler items were included for two reasons. First, they were used to match spacing in the three groups (see 2.2.8). Second, they were used as primacy and recency buffers. Primacy and recency buffers refer to filler items that are included at the beginning and end of the treatment to lessen the influence of serial position (or primacy and recency) effects (e.g., Cull et al., 1996; Delaney, Verkoeijen, & Spirgel, 2010; Karpicke & Roediger, 2007a; Landauer & Bjork, 1978). In other words, when items are presented in series, the first and last several items tend to be remembered better than the middle ones, a phenomenon known as serial position effects (Bonk & Healy, 2010; Delaney et al., 2010; Glanzer & Cunitz, 1966). The primacy and recency buffers were included to reduce these effects on target items.
2.2.7 Treatment In the treatment, 23 English-Japanese word pairs (including three filler items) were studied with a flashcard program. Each target item was encountered five times throughout the treatment in all three groups (see below for the justification for setting the number of encounters to five). In the first encounter with a given item, the English and Japanese words were presented simultaneously for 8 seconds per word pair (initial presentation). Participants were asked to study both the English word and its Japanese translation. After 8 seconds, the program automatically proceeded to the next
34 item. The duration of the initial presentation was set to 8 seconds per item based on previous studies that have shown that 8 seconds is sufficient for learning (Cull, 2000; Karpicke & Roediger, 2007a; Pashler, Zarow, & Triplett, 2003). Each word pair was presented only once in the initial presentation.
In the second and third encounters, target items were practised in a receptive recall format. More specifically, participants were presented with a target word and asked to type the corresponding Japanese translation (Figure 3, top). After typing the response, participants clicked on the OK button or pressed the Enter key twice to proceed. If they were not sure about the correct response, they could leave the answer box blank and click on the OK button or press the Enter key twice. After each response, feedback was provided to the participants (Figure 4). The feedback window indicated whether the response was correct or incorrect. The target English word, Japanese translation, and learners’ response were also given in the feedback window. The feedback was shown for 5 seconds per response because previous studies suggest that 5 seconds is sufficient for learning (Cepeda et al., 2009; Hays, Kornell, & Bjork, 2010; Pashler et al., 2003). After 5 seconds, the program automatically proceeded to the next item.
35 Receptive recall
Productive recall
Figure 3. Examples of the receptive (top) and productive (bottom) recall formats. English translations are provided on the right.
36 Feedback for a correct response
Feedback for an incorrect response
Figure 4. Feedback for a correct response (top) and an incorrect response (bottom). English translations are provided on the right.
In the fourth and fifth encounters, target items were practised in a productive recall format. In this format, participants were presented with a Japanese word and asked to type the corresponding English translation (Figure 3, bottom). Other than that, the productive recall format was exactly the same as the receptive recall format. Filler items were treated in the same way as target items. In the first encounter with each filler item, the English and Japanese words were presented simultaneously for 8 seconds per word pair (initial presentation). In the last encounter, filler items were practised in productive recall. They were studied in receptive recall in the remaining encounters.
37
Unlike the initial presentation and feedback, both of which were paced by the computer, retrieval practice was self-paced (e.g., Balota et al., 2006; Carpenter & DeLosh, 2005; Cull et al., 1996, Experiments 4 & 5; Cull, 2000; Logan & Balota, 2008; Pashler et al., 2003), and participants were allowed to take as much time as they needed to type a response. Retrieval practice was self-paced for three reasons. First, participants who lack computer skills may require more time to type their response than computer-literate students. If retrieval practice is computer-paced, those who are not very computer-literate may not be able to finish typing within the allocated time, leading to ineffective learning. Self-pacing of retrieval practice may enable learners to learn effectively regardless of their familiarity with computers. Second, when learning from paper-based flashcards, it may be common for learners to pace retrieval practice by themselves. Self-pacing of retrieval practice, therefore, may reflect authentic flashcard learning and increase ecological validity. Third, in all nine programs surveyed by Nakata (2011), retrieval practice is self-paced by learners. Thus, self-pacing of retrieval practice may also be representative of authentic computer-based flashcard learning.
The treatment involved both receptive and productive retrieval (Figure 3). This is because previous studies (Griffin & Harley, 1996; Mondria & Wiersma, 2004; S. A. Webb, 2002, 2009a) suggest that in order to gain both receptive and productive vocabulary knowledge efficiently, learners need to practise receptive as well as
38 productive retrieval (see 3.1.1). Receptive retrieval (second and third encounters) preceded productive retrieval (fourth and fifth encounters) because the retrieval practice effect (Baddeley, 1997, p. 112; Ellis, 1995) and the retrieval effort hypothesis (Pyc & Rawson, 2009) indicate that gradually increasing retrieval effort may maximise learning (see 3.1.2). Recall, rather than recognition, formats were used because recall may enhance learning more than recognition (see 3.1.2).
The number of encounters with target words during the treatment was set to five (initial presentation + 2 receptive recall + 2 productive recall) for two reasons. First, Crothers and Suppes (1967, Experiments 8 & 9) suggest that 85 - 88% of the items in their study were acquired after six (Experiment 9) or seven encounters (Experiment 8). Considering that the current study involved fewer items (23 including fillers) than Crothers and Suppes (108 in Experiment 8 and 216 in Experiment 9), encounters of six may lead to a ceiling effect, reducing the potential of showing a difference between groups. Second, the results of the pilot studies suggested that neither a ceiling nor floor effect was observed on the posttests with five encounters. With these considerations in mind, the number of encounters with target words was set to five.
In order to ensure that the order of items would not offer inappropriate help in remembering (Mondria & Mondria-de Vries, 1994; Nation, 2001, pp. 306-307), the item order was randomised for each repetition. The item order was determined randomly by the flashcard program with the constraint that encounters of a given item
39 were separated by at least two, five, and 10 trials for other items in the BS 4, 10, and 20 groups, respectively. For instance, if a given item appeared as the last item in the first round of 20 items in the BS 20 group, it did not appear until the 11th position in the second round so that there would be at least 10 intervening trials (first - 10th items in the second round). A smaller interval was not used because larger spacing generally leads to better long-term retention than shorter spacing (lag effect; see 4.1.1). The item order was also randomised anew for each participant to minimise the potential of an order effect. For instance, for one participant, apparition may be the first target item, and warble may be the last item, whereas for another participant, rue may be the first target item, and billow may be the last.
2.2.8 Spacing in the three groups The present experiment attempted to compare different BSs with equivalent spacing. There are two common methods to control spacing in different treatments. One way is to match the average number of intervening trials (e.g., Balota et al., 2006; Carpenter & DeLosh, 2005; Cull et al., 1996, Experiments 4 & 5; Logan & Balota, 2008; see 2.1.2). In this method, if encounters of a given item are separated by 10 trials on average in two treatments, for instance, they are regarded as having equivalent spacing. The other common method is to match the average amount of time between repetitions (e.g., Cull et al., 1996, Experiments 1-3; Karpicke & Roediger, 2007a; Landauer & Bjork, 1978; Pavlik & Anderson, 2008; Pyc & Rawson, 2007). According to this approach, if encounters of a given item are intervened by 3 minutes on average
40 in two treatments, for instance, they are considered to be controlled for spacing. Note that in order to use the second method, the treatment needs to be paced by the experimenter or computer. Otherwise, it would not be possible to ensure that a given item is encountered every 3 minutes, for instance.
In the present study, spacing was controlled using the first method, and the number of trials was used as an index of spacing. This method was chosen because a self-paced treatment might be more desirable than a computer-paced treatment in terms of effectiveness and ecological validity (see 2.2.7). At the same time, the average amount of time between repetitions was analysed after the experiment to investigate whether spacing in the three BSs was equivalent when time is used as a unit of spacing rather than trial. The analysis indicated that the three groups could be regarded as having roughly equivalent spacing whether trial or time is used as an index of spacing (see 2.2.12).
In order to match the average number of intervening trials in the three groups, trials in the BS 4, 10, and 20 groups were arranged as in Figure 5. (I), (R), and (P) in Figure 5 denote the initial presentation, receptive recall, and productive recall, respectively (see 2.2.7). ‘1-4 (R),’ for instance, indicates that Items 1 to 4 were studied in a receptive recall format once. In this study, the item order was randomised for each repetition (see 2.2.7). Therefore, ‘1-4’ does not mean that items were studied in a fixed order such as Items 1, 2, 3, and 4. In the actual treatment, items were studied in a random
41 order such as Items 2, 1, 3, 4 or 1, 2, 4, and 3. * in Figure 5 indicates the position where a target word was practised in a productive recall format for the first time in each group (see 2.2.13).
Note. Primacy = primacy buffers; Recency = recency buffers; (I) = initial presentation; (R) = receptive recall; (P) = productive recall. * indicates the position of initial productive retrieval in each group (see 2.2.13). Figure 5. Item order in Experiment 1A.
‘Filler x 3’ means that there were three trials for filler items (2.2.6). As illustrated in Figure 5, there were three filler trials at the end of the treatment in all three groups,
42 which served as recency buffers (Karpicke & Roediger, 2007a). There were eight and five filler trials in the middle of the treatment in the BS 4 and 10 groups, respectively. These filler trials were used to match spacing in the three groups (see below). Three, six, and 11 primacy buffers were included at the beginning of the treatment in the BS 4, 10, and 20 groups, respectively. The number of primacy buffers differed among the three groups in order to match the total number of filler trials (14).
Immediately before the recency buffers, there was a final review in all groups, where the target items were studied once in a block of 20 items (Brown, 1924; Kornell, 2009, Experiment 3; McGeoch, 1931, Experiment 3). The final review was included for two reasons. First, it was used to control spacing in the three groups (see below). Second, it was included to control the lag to test in the three groups (see 2.1.2). Without the final review, the first several items in the BS 4 and 10 groups would have a rather long interval to the posttest and might be forgotten by the time of the posttest.
BS 20 group Let us see how spacing is controlled in the three groups. In the BS 20 group, the 20 target word pairs were repeated in a block of 20 items and encountered five times throughout the treatment (Figure 5, right). In other words, there were five cycles of 20 items, and each item was encountered only once in each cycle. The last cycle also functioned as the final review. Because all repetitions of a given item were separated by 19 trials on average, the BS 20 group had an average ISI of 19 trials. As noted in
43 2.2.7, the item order was randomised anew for each cycle. Although randomisation of the item order changed ISIs for individual items, it did not affect the average ISI as a whole because possible differences from the mean ISI (19 trials) were cancelled out across items in each cycle (see 2.1.2). Consequently, the average ISI remained 19 trials even after randomisation. This was also true for the BS 10 and 4 groups below.
BS 10 group In the BS 10 group, target words were repeated in two blocks of 10 items (Items 1 10 and 11 - 20) except in the final review, where they were practised once in a block of 20 items. As in the BS 20 group, the target word pairs were encountered five times throughout the treatment including the final review. Figure 6 illustrates the item order and spacing in the BS 10 group. In Figure 6, the column with the heading Items presents items studied in each cycle. For instance, ‘1-10 x 1’ is given for Cycle 1. This means that Items 1 to 10 were studied once in this cycle. As shown in Figure 6, the treatment in this group consisted of the following: primacy buffers, eight cycles of 10 target items, filler cycle, final review, and recency buffers. In Cycles 1, 2, 5, and 6, items in the first block (Items 1 to 10) were studied, and in Cycles 3, 4, 7, and 8, items in the second block (Items 11 to 20) were studied. The arrangement of the trials in this group is based on that of the 108A group in Crothers and Suppes (1967, Experiment 9; Figure 2), where the trials for the first block were intervened by trials for the second block and vice versa.
44
Figure 6. Item order and spacing (ISI) in the BS 10 group.
In Figure 6, the column with the heading ISI indicates the average ISI for all items in a given cycle. For instance, the first and second encounters of Items 1 to 10 were separated by nine trials on average. Therefore, 9 is given for the ISI column in Cycle 1. Note that the mean ISI between the second and third encounters of target words is longer (34 trials) than the ISI between the first and second encounters (9 trials). This is because the second and third encounters of the first block (Cycles 2 and 5) were separated by Cycle 3 (10 trials), Cycle 4 (10 trials), and the filler cycle (5 trials), and the second and third encounters of the second block (Cycles 4 and 7) were intervened by the filler cycle (5 trials), Cycle 5 (10 trials), and Cycle 6 (10 trials). As a result, 34 (9 + 10 + 10 + 5 = 34) is given for the ISI column in Cycles 2 and 4. Similarly, the mean ISI between the fourth and fifth encounters of target words was longer (29 and 19 trials for the first and second blocks, respectively) than the ISI between the third and fourth encounters (9 trials).
The average ISI in the BS 10 group can be calculated by averaging the mean ISIs of Cycles 1 to 8, that is, (9 + 34 + 9 + 34 + 9 + 29 + 9 + 19) / 8 = 19 trials, which is
45 exactly the same as in the BS 20 group. As matching the average number of intervening trials is a common method of controlling spacing (e.g., Balota et al., 2006; Carpenter & DeLosh, 2005; Logan & Balota, 2008; see above), the BS 10 and 20 groups are regarded as having equivalent spacing. Note that filler items are used to manipulate spacing in this group. Specifically, the filler cycle after Cycle 4 intervened between the second and third encounters of target items (Figure 6). Without the five filler trials in the filler cycle, the average ISI between the second and third encounters would be 29 (34 - 5 = 29) trials, instead of 34. Consequently, without the filler cycle, the average ISI in the BS 10 group would be (9 + 29 + 9 + 29 + 9 + 29 + 9 + 19) / 8 = 17.75 trials and no longer match that of the BS 20 group (19 trials). The final review was also used to manipulate spacing in this group. Without the final review, the mean ISI in this group would be shorter ([9 + 34 + 9 + 34 + 9 + 9] / 6 = 17.33 trials) than in the BS 20 group.
BS 4 group In the BS 4 group, target words were repeated in five blocks of four items (Items 1 - 4, 5 - 8, 9 - 12, 13 - 16, and 17 - 20) except in the final review, where a block of 20 items was used. As in the other two groups, the target word pairs were encountered five times throughout the treatment including the final review. Figure 7 presents the item order and spacing in the BS 4 group. As illustrated in Figure 7, the treatment in this group consisted of the following: primacy buffers, 20 cycles of four target items, filler cycle, final review, and recency buffers. The arrangement of the trials in this group is
46 an extension of that of the 108A group in Crothers and Suppes (1967, Experiment 9; Figure 2). Note that as in the BS 10 group, spacing between the second and third (43 trials) and the fourth and fifth encounters of target items (19 to 35 trials) was longer than spacing between the first and second and the third and fourth encounters (3 trials).
Figure 7. Item order and spacing (ISI) in the BS 4 group.
The average ISI in the BS 4 group can be calculated by averaging the mean ISIs of Cycles 1 to 20, that is, (3 + 43 + 3 + 43 + 3 + 43 + 3 + 43 + 3 + 43 + 3 + 35 + 3 + 31 + 3 + 27 + 3 + 23 + 3 + 19) / 20 = 19 trials, which is exactly the same as in the BS 10 and 20 groups. Because the BS 4, 10, and 20 groups had exactly the same average ISIs (19 trials), the three groups are regarded as being matched in spacing. Note that filler items are also used in the BS 4 group to manipulate spacing. Without the eight filler trials in the filler cycle (Figure 7), the average ISI between the second and third encounters of target words would be 35 trials (43 - 8 = 35), instead of 43. As a result, without the filler cycle, the mean ISI in this group would be (3 + 35 + 3 + 35 + 3 + 35
47 + 3 + 35 + 3 + 35 + 3 + 35 + 3 + 31 + 3 + 27 + 3 + 23 + 3 + 19) / 20 = 17 trials, and spacing would no longer be equivalent in the three groups. The final review was also used to manipulate spacing as in the BS 10 group. Without the final review, the average ISI in the BS 4 group would be shorter ([3 + 43 + 3 + 43 + 3 + 43 + 3 + 43 + 3 + 43 + 3 + 3 + 3 + 3 + 3] / 15 = 16.33 trials) than in the other two groups (19 trials). By matching the mean ISIs in the three groups, the present experiment aimed to isolate the effects of BS and spacing.
2.2.9 Pretest Immediately before the treatment, a receptive recall test was given as the pretest. Specifically, participants were presented with an English word and asked to type its meaning in Japanese. Unlike the treatment, feedback (Figure 4) was not provided in the pretest. Other than that, the pretest was exactly the same as the receptive recall format in the treatment (Figure 3, top). A receptive test was used as the pretest because it is easier than a productive test (Laufer et al., 2004; Laufer & Goldstein, 2004; S. A. Webb, 2008) and hence, more sensitive to a small amount of knowledge that participants might have about target words. A recall, rather than recognition (multiple-choice), test was used because in a recognition test, learners have a relatively high chance of guessing the correct answer. See Appendix A for an example of the pretest.
In addition to the 20 target words, three filler items (husk, polemic, and smudge) and
48 three practice words (apple, orange, and banana) were tested in the pretest. There was one question for each word, and the pretest consisted of 26 questions. At the beginning of the pretest, the three practice words were tested in order to familiarise participants with the test format. Responses for filler and practice items were not included in the results. In order to counterbalance effects of the item order among participants, the order was randomised anew for each participant. Four out of 95 participants demonstrated prior knowledge of one or more target words on the pretest. The data of these students were dropped from analysis (see 2.2.12). Responses on the pretest were scored using the following two scoring procedures: strict and sensitive (see 2.2.11).
2.2.10 Dependent measures Immediate and delayed posttests were administered in the present study. The immediate posttest was given on the same day as the treatment. In order to measure retention, the delayed posttest was administered 1 week after the treatment. Two tests of form and meaning were given at each test administration: productive recall and receptive recall tests (Laufer et al., 2004; Laufer & Goldstein, 2004). Unlike the treatment, feedback was not provided in the posttest. Other than that, the posttests were exactly the same as the corresponding retrieval formats in the treatment (Figure 3). In addition to the 20 target words, three filler items (husk, polemic, and smudge) were tested in the posttest. There was one question for each item, and each posttest consisted of 23 questions. Responses for filler items were not included in the results.
49
At the beginning of each posttest, the three filler items were tested in order to familiarise participants with the test format. The order of items in the posttests was different from that in the treatment in order to make sure that learners would not use the item order as an aid for remembering. There were four posttests in total (immediate productive, immediate receptive, delayed productive, and delayed receptive), and a different randomised order of items was used within each posttest. In order to counterbalance effects of the item order among participants, the order was also randomised anew for each participant. Other than the item order, the delayed posttest was exactly the same as the immediate posttest.
As noted above, the delayed posttest was administered 1 week after the treatment. The interval of 1 week was chosen for two reasons. First, studies have shown that most forgetting occurs immediately after learning (e.g., Anderson & Jordan, 1928; Bahrick, 1984; Cepeda et al., 2008; Rawson & Dunlosky, 2011; Rohrer et al., 2005, Experiment 1; Seibert, 1927, 1930, 1932). Scores on a 1-week delayed posttest, therefore, may be a good indication of retention over time. Second, in pilot studies, no floor effect was observed on the 1-week delayed posttest scores. Based on the above two reasons, the RI (retention interval) of 1 week was chosen.
In both the immediate and delayed posttests, the productive recall posttest preceded the receptive recall posttest. Correct responses in the productive test were used as cues
50 in the receptive test, and correct responses in the receptive test were used as cues in the productive test. Therefore, whichever is given first may influence performance on the other. It was decided to give the productive test prior to the receptive test because administering the receptive test earlier may have a larger effect on test scores than doing it the other way around. More specifically, in the receptive test, English words such as apparition, billow, and cadge were provided as cues. Since most participants have never met these words prior to the treatment, they might acquire knowledge of orthography of these words by seeing these words used as a cue in the receptive test. In contrast, in the productive test, Japanese words such as 幽霊,うねり, and ねだる were given as cues. Because these words were already familiar to participants prior to the experiment, they are not likely to acquire new vocabulary knowledge through the productive test. As there seems to be a larger learning effect from the receptive test than the productive test, the latter was administered prior to the former. See Appendix B for examples of the productive and receptive posttests.
2.2.11 Scoring Responses on the pretest and posttest were scored using the following two scoring procedures: strict and sensitive. Two scoring systems with different sensitivities were used because due to the incremental nature of vocabulary acquisition (e.g., Barcroft & Rott, 2010; Nagy, Herman, & Anderson, 1985; Schmitt, 1998; Thomas & Dieter, 1987; Waring & Takaki, 2003; S. A. Webb, 2012), the results could be misleading if credit is given only for fully correct responses. In the productive test, in the strict
51 scoring method, only responses without any misspellings were scored as correct, and all other responses were marked as incorrect. In the sensitive scoring procedure, responses that would be awarded 0.75 using a lexical production scoring protocol (LPSP; e.g., Barcroft & Rott, 2010; Barcroft, 2002, 2003, 2004; Deconinck, Boers, & Eyckmans, 2010) were also scored as correct.
In LPSP, responses are given 0.00, 0.25, 0.50, 0.75, or 1.00 based on the number of letters that are correct or present. A letter is regarded as correct if a given letter in the target word is placed in the same position in the response. A letter is regarded as present if a letter in the target word is found in the response regardless of the position. LPSP scores are calculated based on the following:
1.00: all letters in the response are correct (i.e., the response is exactly the same as the target word); 0.75: 50% or more but less than 100% of the letters in the response are correct or 75% or more but less than 100% of the letters are present; 0.50: 25% or more but less than 50% of the letters in the response are correct or 50% or more but less than 75% of the letters are present; 0.25: at least one letter in the response is correct or 25% or more but less than 50% of the letters are present; 0.00: all other responses.
52 For instance, suppose that the learner produced mally for the target word dally. This response is given 0.75 in LPSP because 80% (4 / 5 = 0.80) of the letters in the response ( _ ally ) are correct. Next, suppose that the learner produced maddy for dally. This response will be given 0.50 because only 40% (2 / 5 = 0.40) of the letters in the response ( _ a _ _ y ) are correct.
There seem to be advantages and disadvantages to LPSP. One benefit of LPSP may be that it is objective and replicable. An alternative scoring system such as allowing minor spelling mistakes, for instance, may involve subjective judgement and might not be replicable, potentially posing a threat to the internal reliability (Nunan, 1992, pp. 14–15) of the study. LPSP scores, in contrast, can be calculated automatically with a computer program and may be preferable in terms of internal reliability. Second, LPSP takes account of differences in word length unlike other scoring methods such as those used by Waring (1997a) or Thomas and Dieter (1987). In Waring (1997a), for example, responses that were misspelt by one or two letters were given partial credit. His scoring procedure may be biased in favour of shorter words. For instance, for a four-letter target word, responses could be given a partial score even though 50% of the letters (2 / 4 = .50) are incorrect, whereas for a 10-letter word, responses will not be given partial credit unless at least 80% of the letters (8 / 10 = .80) are correct. Thomas and Dieter’s (1987) W-l scoring procedure, where a misspelling of one letter was also marked as correct, may suffer from a similar problem. LPSP scores, in
53 contrast, may be affected by word length to a lesser degree because it uses percentage instead of the number of letters when giving a partial score.
At the same time, LPSP may suffer from at least two limitations. One disadvantage may be that it may be too lenient. For example, for the target word apparition, any response starting with a will be given a score of 0.25 or higher because at least one letter in the response is regarded as correct. As a result, there might be a chance of giving credit for wild guessing, which may not be valid because it may not necessarily reflect participants’ vocabulary knowledge. Second, LPSP scores may need to be treated as ordinal data rather than interval data. For instance, learners may gain an LPSP score of 1.00 in several different ways: 1.00 x 1, 0.75 + 0.25, 0.50 x 2, 0.50 + 0.25 x 2, and 0.25 x 4. Obtaining a full score on one item (1.00 x 1), however, does not necessarily indicate the same amount of knowledge as obtaining partial scores on several items (e.g., 0.25 x 4), and it could be argued that LPSP scores may not be additive. In other words, LPSP scores may need to be considered as ordinal data rather than interval data, which may invalidate the use of a parametric test such as ANOVA.
Despite the above limitations, a scoring method based on LPSP was employed in the present thesis for three reasons. First, LPSP may be preferable to other scoring systems in terms of internal reliability because it is objective and replicable. Second, LPSP takes account of differences in word length unlike other scoring procedures (Thomas & Dieter, 1987; Waring, 1997a) and may be affected by word length to a
54 lesser extent. Third, LPSP was found to be useful in previous studies (e.g., Barcroft & Rott, 2010; Barcroft, 2002, 2003, 2004; Deconinck et al., 2010). At the same time, with its possible limitations in mind, two modifications were made to the original version of LPSP in this study. First, as LPSP may be too lenient, scores of 0.25 and 0.50 were not used in the present study. By so doing, this study attempted to give credit only for responses that approximately resemble target words. Second, in the sensitive scoring procedure in the current study, responses that would be awarded 0.75 using LPSP were given a score of 1.00 instead of 0.75. This was done to ensure that the scores would be interval data, which is amenable to parametric tests. Responses on the productive test were scored automatically by a computer program developed by the author and categorised into the following three: (a) correct in both strict and sensitive scoring (responses that would be awarded 1.00 using LPSP), (b) incorrect in strict but correct in sensitive scoring (responses that would be awarded 0.75 using LPSP), and (c) incorrect in both strict and sensitive scoring (other responses).
In the receptive posttest, responses were scored based on three rules. First, in the strict scoring system, responses were marked as incorrect if they were of a different part of speech as the translation given during the treatment. For instance, as billow was used as a noun during learning (see 2.2.6), the translation for the verb meaning of the word (うねる) was marked as incorrect in the strict scoring procedure. Second, responses were marked as incorrect in the strict scoring method if an intransitive verb was given for a transitive verb and vice versa. For instance, since pique is a transitive verb, 怒る
55 [intransitive] was scored as incorrect in the strict scoring procedure, whereas 怒らせ る [transitive] was marked as correct. Third, in the sensitive scoring system, the above two kinds of responses were both marked as correct. In order to maintain consistency, responses on the receptive test were also scored by a computer program based on answer keys compiled by the author, a native speaker of Japanese. The responses that were marked as incorrect in both strict and sensitive scoring were also manually checked by the researcher.
Responses on the receptive pretest were scored using the same procedures as in the posttest except one difference. In the pretest, a difference in the part of speech was ignored as long as the target word can be used in the given part of speech. This is because the pretest was administered prior to the treatment and participants did not know in which part of speech a given target word would be used during the learning phase. For instance, as billow was used as a noun during learning, the translation for the verb meaning (うねる) was marked as incorrect in the strict scoring in the posttest, whereas it was scored as correct in the sensitive scoring procedure. In the pretest, however, the translation for the verb meaning (うねる) would be marked as correct in both strict and sensitive methods because participants did not know that billow would be used as a noun in this experiment. None of the responses on the receptive pretest fell into this category.
56 2.2.12 Results Study time First, let us examine whether the study time was comparable among the three groups. The average study time (SDs in parentheses) was 19.29 (3.30), 18.34 (2.34), and 18.31 (2.82) minutes in the BS 4, 10, and 20 groups, respectively. (The study time here refers to the time that participants spent studying the target items and excludes the time spent on the pretest, posttest, filler items, filler task, questionnaire, and so forth.) No statistically significant difference was found among the three groups in study time, F (2, 90) = 1.13, p = .328, and no effect size was found (η2 < .01). As the difference in study time was rather small, the efficiency scores (posttest score divided by the study time; e.g., Kornell, 2009; Mondria, 2003; Pyc & Rawson, 2007) were not calculated in this experiment.
Next, let us investigate whether spacing in the three groups was equivalent when time, not trial, was used as an index of spacing (see 2.2.8). Table 3 summarises the average number of seconds between repetitions of a given word pair in the three groups. For instance, the table shows that in the BS 4 group, the first and second encounters of a given target word were separated by 37.88 seconds on average. As shown by Table 3, the average amount of time between repetitions of a given item was 233.60, 226.28, and 224.14 seconds in the BS 4, 10, and 20 groups, respectively. The difference was not statistically significant, F (2, 90) = 0.63, p = .535, and no effect size was found (η2 < .01). As the difference is relatively small, it may be possible to assume that the three
57 groups had roughly equivalent spacing whether trial or time is used as an index of spacing.
Table 3 Average Spacing (Seconds) by Group and Encounter Encounters Groups E1-E2 E2-E3 E3-E4 E4-E5 Average BS 4 M 37.88 524.36 37.52 334.66 233.60 (n = 28) SD 8.79 90.90 7.16 45.94 36.63 BS 10 M 104.38 398.55 106.64 295.55 226.28 (n = 30) SD 18.59 57.75 17.19 29.26 28.52 BS 20 M 226.47 225.33 213.89 230.88 224.14 (n = 33) SD 47.30 46.45 47.73 21.94 36.16 Note. E1 = encounter 1; E2 = encounter 2; E3 = encounter 3; E4 = encounter 4; E5 = encounter 5.
Learning phase performance There were four retrieval attempts for each target word during the treatment (see 2.2.7). Table 4 summarises the number of correct responses for the four retrieval attempts. For instance, the table shows that in the BS 4 group, the average number of correct responses was 11.93, 9.14, 10.32, and 10.86 out of 20 for the first, second, third, and fourth retrieval attempts, respectively. Note that the first and second retrievals used the receptive recall format (Figure 3, top) while the third and fourth retrievals used the productive recall format (Figure 3, bottom).
58 Table 4 Number of Correct Responses During the Learning Phase Retrieval attempts Groups 1 2 3 4 BS 4 M 11.93 9.14 10.32 10.86 (n = 28) SD 4.14 5.15 5.13 5.18 BS 10 M 8.40 8.83 7.80 9.60 (n = 30) SD 3.54 4.35 4.89 5.54 BS 20 M 5.58 9.73 6.70 9.64 (n = 33) SD 2.75 4.27 4.41 5.34 Note. The maximum score is 20 for each cell. Responses were scored with the strict scoring procedure (see 2.2.11).
In order to determine whether any statistically significant difference existed among the three groups, the number of correct responses on receptive and productive retrieval was submitted to two separate two-way mixed design 3 (BS: 4 / 10 / 20) x 2 (retrieval attempt: 1 / 2 for receptive retrieval and 3 / 4 for productive retrieval) ANOVAs. The ANOVA for receptive retrieval showed a significant main effect of BS, F (2, 88) = 4.44, p = .015, partial η2 = .09, and a significant interaction between BS and the retrieval attempt, F (1, 88) = 44.02, p < .001, partial η2 = .50. The main effect of retrieval attempt fell short of significance, F (2, 88) = 3.90, p = .051, partial η2 = .04. The ANOVA for productive retrieval detected a significant main effect of retrieval attempt, F (1, 88) = 44.52, p < .001, partial η2 = .34, and a significant interaction between BS and the retrieval attempt, F (2, 88) = 6.96, p = .002, partial η2 = .14. The main effect of BS was not significant, F (2, 88) = 1.97, p = .146, partial η2 = .04.
As the interaction between BS and the retrieval attempt proved significant on both
59 receptive and productive retrieval, the simple main effect of BS was tested to investigate where the significant differences lay. The simple main effect of BS was significant on the first, F (2, 88) = 25.21, p < .001, and third retrievals, F (2, 88) = 4.46, p = .014, but not on the second, F (2, 88) = 0.31, p = .734, and fourth retrievals, F (2, 88) = 0.52, p = .598. To follow-up the significant simple main effect on the first and third retrievals, the Bonferroni method of multiple comparisons was used to examine where the significant differences lay. The multiple comparisons showed that (a) on the first retrieval attempt, the BS 4 group significantly outperformed the BS 10 (p = .001) and BS 20 groups (p < .001), producing large effect sizes (0.94 < d < 1.87), (b) on the first retrieval attempt, the BS 10 group significantly outperformed the BS 20 group (p = .005), producing a large effect size (d = 0.91), (c) the BS 4 group significantly outperformed the BS 20 group on the third retrieval attempt (p = .013), yielding a large effect size (d = 0.80), and (d) no statistically significant difference existed between the BS 10 group and the other two groups on the third retrieval attempt, and medium or small effect sizes were found (0.25 < d < 0.53). Overall, the findings suggest that the BS 4 group produced the largest number of correct responses during retrieval practice followed by the BS 10 group.
Posttest performance Table 5 provides the immediate and delayed posttest results for the three groups. The productive and receptive test scores were analysed by a two-way mixed design 3 (BS) x 2 (RI: immediate / 1-week delayed) ANOVA. Table 6 shows the results of the
60 ANOVAs. The ANOVAs showed a significant main effect of RI on both productive and receptive tests regardless of the scoring system. In other words, the delayed posttest scores were significantly lower than the immediate posttest scores on both productive and receptive tests. Neither the main effect of BS nor the interaction between BS and the RI reached significance regardless of the posttest or scoring system. The results indicate that BS might have had little effect on posttest performance in this experiment. These findings are also supported by the relatively small effect sizes for the main effect of BS (partial η2 < .02) and the interaction between BS and the RI (.01 < partial η2 < .05).
61
Table 5 Number of Correct Responses on the Posttests Immediate posttest Delayed posttest Productive Receptive Productive Receptive Groups Strict Sensitive Strict Sensitive Strict Sensitive Strict Sensitive 14.86 14.54 14.93 3.07 5.57 11.50 11.71 M 12.82 BS 4 (n = 28) SD 5.03 4.67 4.71 4.74 3.10 3.92 5.15 5.26 13.50 13.87 14.27 3.03 5.13 10.27 10.53 M 11.83 BS 10 (n = 30) SD 5.15 4.93 4.45 4.53 3.62 4.39 5.21 5.19 13.30 14.39 14.70 3.06 4.58 9.61 9.82 M 11.88 BS 20 (n = 33) SD 6.24 6.08 5.12 5.23 3.67 4.18 5.63 5.65 Note. The maximum score is 20 for each cell. Strict = strict scoring; Sensitive = sensitive scoring (see 2.2.11).
Collapsed across the RIs Productive Receptive Strict Sensitive Strict Sensitive 7.95 10.21 13.02 13.32 3.68 3.90 4.74 4.81 7.43 9.32 12.07 12.40 3.95 4.25 4.64 4.66 7.47 8.94 12.00 12.26 4.50 4.66 4.97 5.03
Table 6 Results of Two-Way ANOVAs for the Posttest Scores Posttests Productive
Receptive
Effects BS RI BS X RI BS RI BS X RI
df 2 1 2 2 1 2
Strict scoring F p 0.14 .867 396.57 .000 0.45 .638 0.41 .664 118.71 .000 2.24 .113
2
partial η .00 .82 .01 .01 .57 .05
df 2 1 2 2 1 2
Sensitive scoring F p 0.69 .505 408.41 .000 0.36 .698 0.42 .660 123.63 .000 1.97 .145
partial η2 .02 .82 .01 .01 .58 .04
62
Table 5 also shows that the SDs of the posttest scores were relatively large, indicating a wide range of results between individuals. The lack of significant differences among the three BSs may be partly ascribed to the relatively large SDs. Alternatively, the lack of statistical significance may be due to a Type II error. That is, although the main effect and / or the interaction may have been statistically significant, they might not have reached significance because the sample size was too small or the ANOVA did not have enough statistical power (see below for the definition). In order to explore the possibility of a Type II error, a post hoc power analysis was carried out. A power analysis allows us to estimate the statistical power, or the probability of detecting a statistically significant effect provided that such an effect exists (Cohen, 1992). Table 7 summarises the results of the power analysis.
Table 7 Results of Post Hoc Power Analysis Productive posttest Effects Strict scoring Sensitive scoring BS .08 .19 RI 1.00 1.00 BS X RI .48 .42
Receptive posttest Strict scoring Sensitive scoring .13 .13 1.00 1.00 1.00 1.00
Table 7, for instance, shows that with strict scoring on the productive posttest, the ANOVA had a power of .08, or an 8% probability of detecting statistical significance, for the main effect of BS. As shown by Table 7, the analysis had a power of .08 - .19 for the main effect of BS on the productive and receptive posttests and a power of .42 - .48 for the interaction between BS and the RI on the productive posttest.
63 Considering that the power of .80 (Cohen, 1988) is recommended to avoid a Type II error, the power analysis suggests that the lack of statistical significance may be partly due to a Type II error.
Questionnaire In the questionnaire given after the immediate posttest, participants were asked to indicate what they considered the optimal BS to be when studying 20 English words by choosing a number from 1 to 20 (see 2.2.5). On average, the participants perceived BSs of 6.48 (3.95), 7.86 (4.70), and 7.82 (4.59) words to be most effective (SDs in parentheses) in the BS 4, 10, and 20 groups, respectively. (One participant from each of the BS 4 and 10 groups did not provide responses.) No statistically significant difference existed among the three groups in their responses, F (2, 88) = 0.88, p = .419, and no effect size was found (η2 < .01). The results indicate that (a) learners tended to believe that a relatively small BS might be effective when studying 20 English words, and (b) the three groups did not differ significantly from each other in their responses.
2.2.13 Discussion Learning phase performance The BS 4 group produced the largest number of correct responses during retrieval practice followed by the BS 10 group. On the first retrieval, the BS 4 group significantly outperformed the other two groups, and large effect sizes were observed
64 (0.94 < d < 1.87). The results may be partly ascribed to a difference in spacing between encounters. More specifically, the first retrieval attempt occurred after a shorter ISI (3 trials; Figure 7) for the BS 4 group than for the other two groups (BS 10: 9 trials, BS 20: 19 trials; see Figure 5 and Figure 6). As a result, the memory for the target items might have decayed more in the BS 10 and 20 groups by the time of the first retrieval, possibly leading to the BS 4 group’s higher recall. The BS 10 group significantly outperformed the BS 20 group on the first retrieval as well (d = 0.91). The BS 10 group might have fared significantly better because the ISI before the first retrieval was shorter for this group (9 trials) compared with that of the BS 20 group (19 trials).
The third retrieval attempt also occurred after a longer ISI for the BS 20 group (19 trials; Figure 5) than for the other two groups (BS 4: 3 trials, BS 10: 9 trials; Figure 6 and Figure 7). However, the advantage of the BS 4 and 10 groups was smaller on the third retrieval than on the first. The results might have been caused by the difference in the retrieval format: While target items were practised in receptive recall in the first and second retrievals, the third and fourth retrievals used the productive recall format (see 2.2.7). Because productive retrieval is more demanding than receptive retrieval (Griffin & Harley, 1996; Mondria & Wiersma, 2004; Schneider, Healy, & Bourne, 2002; Waring, 1997a), the BS 4 and 10 groups’ accuracy on the third retrieval attempt might not have improved as much as on the first retrieval despite the shorter ISI. As a result, the advantage of the BS 4 and 10 groups might have been smaller on the third
65 retrieval.
The second and fourth retrieval attempts occurred after a shorter interval for the BS 20 group in comparison with the BS 4 and 10 groups (Figure 5). However, there was no significant difference between the BS 20 and the other two groups in recall performance on these retrievals. The results may be ascribed in part to the retrieval practice effect (e.g., Baddeley, 1997, p. 112; Ellis, 1995), according to which a successful retrieval from memory yields superior retention to an unsuccessful retrieval (see 2.1). Due to this effect, the BS 4 and 10 groups’ successful retrievals on the first and third attempts might have facilitated subsequent recall, enabling the two groups to perform as well as the BS 20 group on the second and fourth retrievals despite the rather long ISIs.
Posttest performance Although the BS 4 group led to the best learning phase performance, no statistically significant difference was found among the three groups in their posttest scores. The findings are at odds with most previous studies on BS, which found the advantage of a large BS over a small one (Brown, 1924; Crothers & Suppes, 1967, Experiments 8 & 9; Kornell, 2009, Experiments 1-3; McGeoch, 1931, Experiments 1 & 3; Seibert, 1932). There may be five interpretations for the contradictory results. First, the inconsistent findings may stem from a methodological difference. As pointed out in 2.1.2, BS and spacing were confounded in all existing studies where target words
66 were encountered more than once. In this experiment, however, all three BSs had equivalent spacing, and the effects of BS and spacing were isolated. The findings of the present experiment may suggest that as long as spacing is equivalent, BS may have little effect on learning.
Another possible cause for the lack of statistical significance may be the relatively limited range of BSs used. In this experiment, the BSs of four, 10, and 20 words were compared because based on previous studies (Brown, 1924; Kornell, 2009; McGeoch, 1931; Seibert, 1932), it was judged that the use of these three BSs was ecologically valid and might provide statistically significant results (see 2.2.4). However, considering that the present and previous studies differed in a number of factors such as the participants, materials, retrieval format during learning, number of encounters with target words, posttest format, or RI, the findings of the earlier research may not necessarily be applicable to the present experiment. As a result, a wider range of BSs (e.g., four, 20, and 60 words) might have been necessary to yield statistical significance.
An alternative explanation for the lack of a significant effect may be that the task difficulty was too high in this experiment. Previous studies have suggested that a large BS may be superior to a small one only when difficulty is low (see 2.1.1). As the present experiment used receptive and productive recall formats, which are more demanding than recognition formats (e.g., Butler & Roediger, 2007; Kang et al., 2007;
67 Laufer et al., 2004; Laufer & Goldstein, 2004; Nation, 1982, 2001, p. 305), the task difficulty might have been relatively high. This may be in part responsible for the lack of significant differences among the three BSs. Fourth, the lack of significant differences among the three BSs may be partly ascribed to the relatively large differences between individuals as indicated by the SDs on the posttest scores. Lastly, the lack of statistical significance may be due to a Type II error (see 2.2.12).
Rationale for the next experiment Unlike existing studies, Experiment 1A compared the effectiveness of different BSs that were matched in spacing. The experiment suggested that there may be little difference among the BSs of four, 10, and 20 words in their effectiveness. The findings may be significant because they imply that spacing may be more important a factor than BS. A possible limitation of this experiment, however, may be that we cannot necessarily attribute the lack of statistical significance to the fact that the three BSs were controlled for spacing. In order to argue that equivalent spacing was responsible for the inconsistent results from earlier studies, at least the following three hypotheses may need to be supported: (1) when spacing is equivalent, a large BS does not outperform a small BS, (2) when spacing is not equivalent, a large BS outperforms a small BS, and (3) a small BS with longer spacing outperforms a small BS with shorter spacing. Although the results of Experiment 1A were consistent with the first hypothesis, it did not allow us to test the latter two. None of the previous studies has examined the above three hypotheses either. Unless the second and third
68 hypotheses are also verified, we cannot necessarily rule out the possibility that some factors other than spacing such as the limited range of BSs used, high task difficulty (Crothers & Suppes, 1967; Nation, 1982, 2001, p. 305), relatively large SDs on the posttest scores, and / or a Type II error might have been responsible for the lack of a significant effect. With this limitation in mind, Experiment 1B was conducted to test the above three hypotheses.
Another limitation of the current experiment may be that the three BSs were not controlled for the position of initial productive retrieval during learning. In this experiment, the first and second retrievals (second and third encounters) used the receptive recall format, and the third and fourth retrievals (fourth and fifth encounters) used the productive recall format (see 2.2.7). As the three groups differed in when a target word was initially encountered for the third retrieval, they also differed in the position of initial productive retrieval (* in Figure 5). Specifically, a target word was practised productively for the first time in the 56th, 62nd, and 72nd trials in the BS 4, 10, and 20 groups, respectively (BS 4: beginning of Cycle 12 in Figure 7, 3 fillers + 4 trials x 11 cycles + 8 fillers + 1 trial = 56th trial; BS 10: beginning of Cycle 6 in Figure 6, 6 fillers + 10 trials x 5 cycles + 5 fillers + 1 trial = 62nd trial; BS 20: beginning of the fourth cycle of target items in Figure 5, 11 fillers + 20 trials x 3 cycles + 1 trial = 72nd trial). The difference in the position of initial productive retrieval may be problematic because it might affect how much attention participants may pay to the spelling of target words during the treatment. More specifically,
69 exposure to productive retrieval earlier in the BS 4 treatment might have encouraged the BS 4 group to pay close attention to the spelling of target words earlier than the other two groups, potentially leading to higher gains in productive knowledge. With this limitation in mind, the position of initial productive retrieval will be controlled in the next experiment.4
2.3 Experiment 1B 2.3.1 Purpose The purpose of Experiment 1B was two-fold. The central goal was to test the three hypotheses put forward in 2.2.13. If all three hypotheses are supported, it would suggest that (a) spacing may have a larger effect on learning than BS, and (b) the lack of statistical significance in Experiment 1A might have been caused because the three BSs had equivalent spacing. By testing the three hypotheses, the present experiment may allow us to examine the relative importance of BS and spacing on L2 vocabulary learning, possibly helping us to determine what BS should be used in order to optimise L2 vocabulary learning from flashcards. Another objective for the current experiment was to compare the effects of different BSs while controlling the position of initial productive retrieval during learning. In Experiment 1A, the three BSs were not matched for the position of initial productive retrieval (* in Figure 5), which might have affected learning (see 2.2.13). In order to investigate the effects of BSs in a more 4
The position of initial receptive retrieval, in contrast, was roughly equivalent in all three groups: the eighth trial in the BS 4 group (beginning of Cycle 2 in Figure 7) and the fourth trial in the other two groups (fourth filler trial; Figure 5). It might be reasonable to assume, therefore, that the difference in the position of initial receptive retrieval probably had little effect on posttest results.
70 rigorous manner, the position of initial productive retrieval was controlled in the present experiment.
2.3.2 Participants The original pool of participants consisted of 92 first-year Japanese students at the same university as in Experiment 1A. Out of the 92 students, 14 were dropped because they were absent on the day of the experiment, chose not to participate, or their data were lost due to technical problems, leaving 78 students. The 78 participants consisted of 39 Engineering and 39 Economics majors. Their average score on the first to the sixth 1,000-word frequency levels of the VST (Nation & Beglar, 2007) was 29.83 (SD = 6.27) out of 60 and significantly lower than that of Experiment 1A participants, t (167) = 4.17, p < .001, yielding a medium-sized effect (r = .31). The participants were randomly assigned to the control, BS 4, and 20 groups so that there would be no significant difference in the VST scores, F (2, 77) = .03, p = .972, η2 < .01. Each of the control, BS 4, and 20 groups consisted of 26 participants. None of the participants exhibited prior knowledge of any of the target words on a productive pretest. It is assumed, therefore, that the three groups did not differ from one another in terms of their productive knowledge of the target words at the outset of the experiment. The results of the receptive pretest will be discussed in 2.3.6. The three groups also had a roughly equal number of Engineering and Economics majors (Table 8).
71 Table 8 VST Scores and Majors of Participants VST scores Groups n M SD Control 26 29.73 6.02 BS 4 26 30.08 6.57 BS 20 26 29.69 6.45 Note. The maximum score is 60 for the VST.
Majors Engineering 12 14 13
Economics 14 12 13
2.3.3 Experimental design The following three treatments were compared in this experiment: (a) a BS of 20 words (BS 20 treatment), (b) a BS of four words with equivalent spacing as the BS of 20 words (BS 4 treatment), and (c) a BS of four words with shorter spacing than the BS of 20 words (control treatment). The first two treatments were exactly the same as in Experiment 1A. They had different BSs, but had equivalent spacing (see 2.2.8). The control treatment used the same BS as the BS 4 treatment, but had shorter spacing than the other two.5 Hereafter, the BS 4 and 20 treatments will be collectively referred to as the experimental treatments as opposed to the control treatment. The BS 10 treatment was not used in this experiment because (a) the purpose of the current experiment can be achieved by comparing only the control, BS 4, and 20 treatments, and (b) Experiment 1A indicated that no significant difference existed among the BSs of four, 10, and 20 words in their effectiveness. Figure 8 summarises the item order in the three treatments. Figure 8 should be read in the same way as Figure 5. (I) and (P) in Figure 8 denote the initial presentation and productive recall, respectively. ‘1-4 (P)’
5
The control treatment in this study was not the true control treatment (see Norris & Ortega, 2000) because the control group was also exposed to the target materials. For purposes of exposition, however, this treatment will be referred to as the control treatment.
72 indicates that Items 1 to 4 were studied in a productive recall format once. ‘Filler x 11’ means that there were 11 trials for filler items (see 2.2.8).
Note. Primacy = primacy buffers; Recency = recency buffers; (I) = initial presentation; (P) = productive recall. Figure 8. Item order in Experiment 1B.
73 In the control treatment, target words were repeated in five blocks of four items (Items 1 - 4, 5 - 8, 9 - 12, 13 - 16, and 17 - 20), and encounters of a given item were separated by three trials on average throughout the treatment. As a result, the control treatment had shorter spacing (mean ISI = 3 trials) than the two experimental treatments (mean ISI = 19 trials; see 2.2.8). Unlike in the BS 4 and 20 treatments, the final review was not used in the control treatment for three reasons. First, adding the final review would increase spacing in the control treatment. If the final review were added, the average ISI in the control treatment would be 13 trials ([3 + 3 + 3 + 67 + 3 + 3 + 3 + 55 + 3 + 3 + 3 + 43 + 3 + 3 + 3 + 31 + 3 + 3 + 3 + 19] / 20 = 13), which would not be very different from the ISI in the other two treatments (19). Adding the final review, thus, might make it difficult to test Hypothesis 2 and 3.
Second, it is true that if the final review is not used, the first four blocks of items (Items 1 - 16) in the control treatment will have greater lag to test than in the experimental treatments (see Figure 8), and results of the current experiment may be partly attributed to lag to test rather than spacing per se. However, previous studies suggest that long spacing may be more effective than short spacing even though lag to test is equivalent (Brown, 1924; Kornell, 2009, Experiment 3; McGeoch, 1931, Experiment 3; see 2.1.2). Not controlling the lag to test, hence, may not have a very large effect on the results of this experiment.
Third, previous studies suggest that the effects of lag to test may diminish as the RI
74 increases (e.g., Anderson & Jordan, 1928; Bahrick, 1984; Cepeda et al., 2008; Rawson & Dunlosky, 2011; Seibert, 1927, 1930, 1932). In other words, although lag to test may affect immediate posttest scores to some extent, its effect on the delayed posttest results might be minimal. For the above three reasons, it was decided not to use the final review in the control treatment. In order to ascertain to what extent differential lag to test might have affected learning, a possible relationship between the posttest performance and lag to test was analysed after the experiment. This analysis suggested that lag to test might have had little effect on the posttest scores (see 2.3.6).
2.3.4 Method The methodology of this experiment differed from that of Experiment 1A in three respects. First, although the target words were practised in both receptive and productive recall formats in the earlier experiment, the present experiment involved only productive retrieval. Second, whereas only a receptive pretest was given in Experiment 1A, both receptive and productive pretests were administered in this experiment. Third, the filler items used in the previous experiment (husk, polemic, and smudge) were replaced with promontory, urn, and vestige. The rationale for the above three changes is given below.
Retrieval format Unlike in Experiment 1A, the treatment in this experiment involved only productive retrieval (Figure 3, bottom). This was done to control the position of initial productive
75 retrieval during learning. In the earlier experiment, the first and second retrievals (second and third encounters) used the receptive recall format, and the third and fourth retrievals (fourth and fifth encounters) used the productive recall format (see 2.2.7). If the same procedure were employed in the current experiment, the three treatments would differ greatly in the position of initial productive retrieval. Specifically, while target words would be practised productively for the first time in the 24th trial (beginning of the fourth cycle of the first block [Items 1-4] in Figure 8; 11 fillers + 4 trials x 3 + 1 trial = 24) out of 114 in the control group, the productive format would not be used until the 56th and 72nd trials in the BS 4 and 20 groups, respectively (* in Figure 5; see 2.2.13). The difference in the position of initial productive retrieval may be problematic because it might affect how much attention participants may pay to the spelling of target words during the treatment.
In the present experiment, the initial position of productive retrieval was controlled by taking out receptive retrieval and using only productive retrieval. By so doing, the position of initial productive retrieval was roughly equivalent in all three groups: the eighth trial in the BS 4 group (beginning of Cycle 2 in Figure 7) and the fourth trial in the other two groups (fourth filler trial; Figure 8). Productive, not receptive, retrieval was used because previous studies (Griffin & Harley, 1996; Mondria & Wiersma, 2004; Steinel et al., 2007; Waring, 1997a; S. A. Webb, 2009b) indicate that productive retrieval may result in adequate gains in receptive knowledge as well as large gains in productive knowledge and may be more effective than receptive retrieval (see 3.1.1).
76
One potential problem with changing the retrieval format may be that it might reduce the probability of showing a difference between BSs. Previous studies have demonstrated that a large BS may be more effective than a small one only when difficulty is low (see 2.1.1). Because productive retrieval is more demanding than receptive (Griffin & Harley, 1996; Mondria & Wiersma, 2004; Schneider et al., 2002; Waring, 1997a), taking out receptive retrieval and using only productive retrieval may diminish a potential difference among groups by increasing task difficulty. The posttest results, however, suggested that the change in the retrieval format might have had little effect on the posttest scores (see 2.3.6).
Pretest Whereas only a receptive pretest was given in Experiment 1A, both receptive and productive pretests were administered in this experiment. The productive pretest was also conducted because scores on the productive posttest were the main dependent variables in this experiment: As target words were practised only in productive recall in the current experiment (see above), scores on the productive posttest, which used exactly the same format as the productive recall format during learning, might be a more direct and reliable measure of learning outcomes than those on the receptive posttest. Hence, it was decided to measure productive as well as receptive knowledge in the pretest.
77 The productive pretest was given prior to the receptive pretest. This was done for two reasons. First, as scores on the productive posttest were the main dependent variables, it was more important to have accurate scores on the productive pretest than on the receptive. The productive pretest, therefore, was followed by the receptive pretest so that scores on the former might not be influenced by the latter. Second, administering the receptive pretest earlier may have a larger effect on test scores than doing it the other way around because participants might acquire new vocabulary knowledge through the receptive test (see 2.2.10). With these considerations in mind, the productive pretest preceded the receptive pretest.
In the productive pretest, it is necessary to prevent participants from providing synonyms for a target word because if participants produce hair for the target word mane, for instance, it is not clear whether or not they are familiar with mane. In order to prevent learners from providing synonyms, one letter in the target word and the number of letters in the word (hereafter, retrieval cue) were given together with the Japanese translation in the pretest. For instance, for mane, _ _ n _ was provided as the retrieval cue in addition to the Japanese translation. The retrieval cues were determined based on the following rules: 1) A letter is provided so that there is no synonym with the same number of letters (hereafter, same-length synonym) that has the same letter in the same position. For instance, if _ a _ _ is given for mane, some learners may produce hair, a synonym for mane. To prevent participants from doing so, _ _ n _ was chosen as the
78 retrieval cue for mane. In order to meet Rule1, a list of same-length synonyms for each target and filler item was compiled based on the following: Collins Thesauruses, WordNet (http://wordnet.princeton.edu/), Kenkyusha's New College Japanese-English Dictionary, and pilot testing with seven advanced Japanese learners of English. 2) The first or last letter of the word is not given as the retrieval cue. This is because learners tend to remember the beginning or ending of the word more than the middle (Barcroft & Rott, 2010), and providing the initial or final letter may have a larger effect on pretest performance than giving a letter in the rest of the word. 3) In order to minimise effects that the productive pretest may have on receptive pretest scores, a letter is chosen so that there is another target or filler item that has the same letter in the same position. For instance, if learners are given _ _ q _ _ for 怒らせる (pique), they may be able to infer that a five-letter word that has q as the third letter means 怒らせる. Because there is no other five-letter item that has q in the middle, when learners encounter pique in the receptive pretest, they may be able to answer correctly on this item without having any prior knowledge of the word. In contrast, suppose that _ i _ _ _ is given as the cue for pique. As there is another five-letter target word that has i as the second letter (mirth), learners may not know whether pique or mirth means 怒らせる (pique). As a result, _ i _ _ _ will be chosen over _ _ q _ _ in order to minimise effects on receptive pretest scores.
79 4) When it is not possible to meet Rule 3 above, a letter that appears more frequently throughout the target and filler items than others will be given. For instance, based on Rules 1 and 2, for the target word fracas, the following three cues are possible: _ r _ _ _ _, _ _ a _ _ _, and _ _ _ c _ _. ( _ _ _ _ a _ is not possible because of the synonym affray; Rule 2.) Note that none of these three satisfies Rule 3 above. In other words, there is no other six-letter item that has r as the second, a as the third, or c as the fourth letter. In this case, _ _ a _ _ _ will be chosen because a is used more frequently (13 times) than r (10 times) and c (five times) among the target and filler items. A frequently used letter will be provided as the cue in order to minimise effects that the productive pretest may have on receptive pretest performance. For instance, suppose that learners are given _ _ _ c _ _ for けんか (fracas). As there are only three items that have c in the medial position (fracas, loach, and scowl), learners might be able to answer correctly on fracas in the receptive pretest without having any prior knowledge. In contrast, suppose that learners are given _ _ a _ _ _ for けんか (fracas). As there are 11 items that have a in the medial position (apparition, cadge, citadel, dally, fawn, fracas, loach, mane, nadir, quail, and warble), learners may be less likely to guess that fracas means けんか unless they have some prior knowledge, thus minimising possible effects on receptive pretest scores.
Following the above rules, the retrieval cues were determined. Appendix C summarises the cues used in the pretest. A retrieval cue was given for target items
80 without same-length synonyms as well as those with them. This is because if the retrieval cue is provided only for items with same-length synonyms, it may affect pretest performance. In addition to target items, a retrieval cue was also provided for filler items so that participants might not differentiate between target and filler items. Unlike in the pretest, a retrieval cue was not given in the productive posttest. This is because in the posttest, learners were instructed to produce only English words that were studied during the treatment and informed that giving a synonym for target words would be marked as incorrect. See Appendix A for an example of the productive pretest.
Filler words The filler items used in Experiment 1A (husk, polemic, and smudge) were replaced with promontory, urn, and vestige in this experiment. These three words were chosen based on the same criteria as in the previous experiment (see 2.2.6). The filler items were replaced because rue, citadel, and apparition were the only three-, seven-, and 10-letter target items, respectively, and adding a filler item of the same length might minimise effects that the productive pretest might have on receptive pretest scores. More specifically, when learners are given 後悔する ( _ u _ ) as a cue for rue in the productive pretest, they may be able to infer that a three-letter item used in the experiment means 後悔する (rue). Without any other three-letter item except rue, participants might have a relatively high chance of answering correctly on this item on the receptive pretest without having any prior knowledge. Another three-letter word
81 (urn), therefore, was added as a filler item in the current experiment. With the addition of urn, learners may not know whether rue or urn means 後悔する (rue), which may minimise effects on receptive pretest performance. Similarly, as citadel and apparition were the only seven- and 10-letter target items, seven- and 10-letter filler items (vestige and promontory) were added in the present experiment. Although polemic, a filler item used in Experiment 1A, also consists of seven letters, it was not used in this experiment because using two filler items beginning with p (promontory and polemic) may affect learning.
2.3.5 Scoring Responses on the pretest and posttest were scored using the same procedures as in Experiment 1A: strict and sensitive (see 2.2.11).
2.3.6 Results Pretest None of the participants exhibited prior knowledge of any of the target words on the productive pretest. Participants did not provide synonyms for a target word on the productive pretest either possibly due in part to the provision of retrieval cues (e.g., _ _ n _ for mane). On the receptive pretest, 18 out of 78 participants answered correctly on one or more target words. The average pretest score with strict scoring (SDs in parentheses) was 0.27 (0.45), 0.31 (0.55), and 0.19 (0.57) out of 20 in the control, BS 4, and 20 groups, respectively. With sensitive scoring, the average pretest
82 score was 0.27 (0.45), 0.35 (0.56), and 0.19 (0.57) out of 20 in the control, BS 4, and 20 groups. The receptive pretest scores in the present experiment were higher than in Experiment 1A, where only four out of 95 participants demonstrated prior knowledge. Considering that the English proficiency of the participants in this experiment may be lower than in Experiment 1A (see 2.3.2), the higher receptive pretest scores might be partially attributed to the fact that the productive pretest preceded the receptive pretest in the current experiment (see 2.3.4). Because correct responses in the receptive pretest were used as cues in the productive pretest, administering the latter prior to the former might have affected performance on the receptive pretest. For instance, as learners were given 後悔する ( _ u _ ) as a cue for rue in the productive pretest, they could perhaps infer that a three-letter word that has u in the middle would mean 後悔する (rue). Consequently, some participants might have been able to answer correctly on this item in the receptive pretest without having any prior knowledge. As the correct responses on the receptive pretest may not necessarily indicate their prior knowledge, participants who answered correctly on the receptive pretest were not excluded from analysis. In order to correct for differences in the pretest scores, gains from the pretest to the posttest were analysed when examining the receptive test results (see below).
Study time First, let us examine whether the study time was comparable among the three groups. The average study time in the three groups (SDs in parentheses) was 17.87 (3.02),
83 17.56 (2.13), and 17.34 (2.42) minutes in the control, BS 4, and 20 groups, respectively. (As in 2.2.12, the study time here refers to the time that participants spent studying the target items.) No statistically significant difference was found among the three groups in study time, F (2, 77) = 0.28, p = .755, and no effect size was found (η2 < .01). As the difference in study time was rather small, the efficiency scores (posttest score divided by the study time; e.g., Kornell, 2009; Mondria, 2003; Pyc & Rawson, 2007) were not calculated in this experiment either.
Next, let us investigate how much time intervened between the repetitions of target words in the three groups. Table 9 gives the average number of seconds between encounters of a given word pair in the three groups. As shown by the table, repetitions of a given item were separated by 32.42, 198.47, and 195.65 seconds on average in the control, BS 4, and 20 groups, respectively. The difference in the time between the repetitions reflects the difference in the number of intervening trials (control: 3 trials, BS 4 and 20: 19 trials; see 2.3.3). A one-way ANOVA found a statistically significant difference among the three groups, F (2, 77) = 383.32, p < .001, η2 = .83. According to the Bonferroni method of multiple comparisons, the two experimental groups had significantly longer spacing than the control group (both p < .001), producing large effect sizes (d = 8.48 for control vs. BS 4 and d = 6.95 for control vs. BS 20). The difference between the BS 4 and 20 groups was not statistically significant (p = 1.000), producing a very small effect size (d = 0.09). The results suggest that whether trial or time is used as an index of spacing, (a) the control group had significantly shorter
84 spacing than the BS 4 and 20 groups, and (b) the two experimental groups had roughly equivalent spacing.
Table 9 Average Spacing (Seconds) by Group and Encounter Encounters Groups E1-E2 E2-E3 E3-E4 E4-E5 Average Control M 34.41 32.43 31.72 31.11 32.42 SD 8.67 6.93 5.21 7.67 5.38 BS 4 M 34.25 448.95 30.84 279.84 198.47 SD 3.85 61.15 4.91 43.71 27.15 BS 20 M 184.38 197.81 200.92 199.49 195.65 SD 28.99 47.80 39.49 34.20 32.76 Note. n = 26 for each group. E1 = encounter 1; E2 = encounter 2; E3 = encounter 3; E4 = encounter 4; E5 = encounter 5.
Learning phase performance Table 10 summarises the number of correct responses for the four retrieval attempts during the treatment. Note that unlike Experiment 1A, where the target words were studied in both receptive and productive formats, this experiment used only the productive format (see 2.3.4). In order to determine whether any significant difference existed among the three groups, the number of correct responses was submitted to a two-way mixed design 3 (treatment: control / BS 4 / BS 20) x 4 (retrieval attempt: 1 / 2 / 3 / 4) ANOVA. The ANOVA detected significant main effects of treatment, F (2, 76) = 26.47, p < .001, partial η2 = .41, and retrieval attempt, F (2.38, 178.63) = 238.55, p < .001, partial η2 = .76. The interaction between the two variables was also significant, F (4.76, 178.63) = 24.18, p < .001, partial η2 = .39.6 6
As Mauchly's test showed that sphericity assumptions were violated, the Greenhouse-Geiser correction was used (Field, 2009). As a result, the dfs for the main
85
Table 10 Number of Correct Responses During the Learning Phase Retrieval attempts Groups 1 2 3 4 Control M 10.62 14.92 17.19 17.88 SD 4.45 4.13 3.10 2.83 BS 4 M 7.62 5.23 11.96 10.73 SD 4.18 3.48 5.00 4.92 BS 20 M 4.31 7.23 10.04 12.77 SD 2.40 3.81 4.10 4.35 Note. n = 26 for each group. The maximum score is 20 for each cell. Responses were scored with the strict scoring procedure (see 2.2.11).
Due to the significant interaction between the treatment and retrieval attempt, the simple main effect of treatment was tested to investigate where the significant differences lay. The simple main effect of treatment was significant on all four retrieval attempts, F (2, 75) > 18.04, p < .001. To follow-up the significant simple main effect, the Bonferroni method of multiple comparisons was used to examine where the significant differences lay at each retrieval attempt. The results of the multiple comparisons are summarised in Table 11. The table indicates the following three things. First, the control group significantly outperformed the other two groups on all four retrieval attempts, and medium to large effect sizes were found (0.69 < d < 2.54). Second, on the first retrieval, the BS 4 group fared significantly better than the BS 20 group, showing a large effect size (d = 0.97). Third, the differences were not statistically significant for all other comparisons (p > .188), and only small to medium effect sizes were observed (0.42 < d < 0.55). Overall, the findings suggest that the effect of retrieval attempt and the interaction between the treatment and retrieval attempt contain decimal values.
86 control group produced the largest number of correct responses during retrieval practice followed by the BS 4 group.
Table 11 Results of Multiple Comparisons for Learning Phase Performance Control BS 4 Retrievals Groups p d p d 1 Control BS 4 .017 0.69 BS 20 .000 1.76 .007 0.97 2 Control BS 4 .000 2.54 BS 20 .000 1.94 .188 0.55 3 Control BS 4 .000 1.26 BS 20 .000 1.97 .295 0.42 4 Control BS 4 .000 1.78 BS 20 .000 1.39 .237 0.44
BS 20 p
d
Posttest performance Table 12 provides the immediate and delayed posttest results for the three groups. The productive and receptive test scores were analysed by a two-way mixed design 3 (treatment) x 2 (RI) ANOVA. As some items were answered correctly on the receptive pretest (see above), the pretest scores were subtracted from the posttest scores and gains were analysed when examining the receptive test results. Table 13 shows the results of the ANOVAs. The ANOVAs showed a significant main effect of RI on both productive and receptive tests regardless of the scoring procedure. In other words, the delayed posttest scores were significantly lower than the immediate posttest scores on both productive and receptive tests. The main effect of treatment was significant with
87 strict scoring on the productive posttest, and approached significance with sensitive scoring on the productive posttest and with strict and sensitive scoring on the receptive posttest. The interaction between the treatment and RI approached significance with sensitive scoring on the productive posttest, but was not significant with strict scoring on the productive posttest and with strict and sensitive scoring on the receptive posttest.
88
Table 12 Number of Correct Responses on the Posttests Immediate posttest Delayed posttest Productive Receptive Productive Receptive Groups Strict Sensitive Strict Sensitive Strict Sensitive Strict Sensitive Control 13.77 11.81 12.38 1.62 3.00 6.81 7.08 M 11.69 4.07 4.78 4.95 1.92 2.65 4.72 4.86 SD 4.33 BS 4 14.12 13.58 14.12 4.04 5.81 9.54 9.85 M 12.31 4.57 4.15 4.05 3.42 3.90 5.27 5.38 SD 5.27 BS 20 15.27 14.46 15.12 3.73 6.04 9.62 10.04 M 13.92 4.40 4.53 4.60 2.20 3.61 5.59 5.77 SD 4.82 Note. n = 26 for each group. The maximum score is 20 for each cell.
Collapsed across the RIs Productive Receptive Strict Sensitive Strict Sensitive 6.65 8.38 9.31 9.73 2.76 2.90 4.36 4.45 8.17 9.96 11.56 11.98 3.94 3.87 4.55 4.58 8.83 10.65 12.04 12.58 2.80 3.27 4.70 4.76
Table 13 Results of Two-Way ANOVAs for the Posttest Scores Posttests Productive
Receptive
Effects Treatment RI Treatment X RI Treatment RI Treatment X RI
df 2 1 2 2 1 2
F 3.13 375.72 1.61 2.94 136.75 0.57
Strict scoring p .049 .000 .207 .059 .000 .569
2
partial η .08 .83 .04 .07 .65 .01
df 2 1 2 2 1 2
Sensitive scoring F p 3.10 .051 434.50 .000 2.52 .088 3.00 .056 132.87 .000 0.55 .578
partial η2 .08 .85 .06 .07 .64 .01
89
The Bonferroni method of multiple comparisons was used to investigate where the significant differences lay. The results of the multiple comparisons are summarised in Table 14. For instance, the table shows that with strict scoring on the immediate productive posttest, the difference between the control and BS 4 groups was not statistically significant (p = 1.000, d = 0.13). Table 14 also indicates the following four things. First, no statistically significant difference existed among the three groups on the immediate productive posttest regardless of the scoring procedure, producing no more than small effect sizes (0.08 < d < 0.49). Second, on the delayed productive posttest, the BS 4 and 20 groups significantly outperformed the control group with both strict and sensitive scoring, and large effect sizes were observed (0.84 < d < 1.02). Third, on the receptive posttest, no statistically significant difference existed among the three groups regardless of the scoring system or RI, showing medium or smaller effect sizes (0.04 < d < 0.60). Fourth, the difference between the BS 4 and 20 groups was rather small on both productive and receptive tests irrespective of the scoring system as indicated by the lack of statistical significance as well as the effect sizes (0.04 < d < 0.32), mirroring the findings of Experiment 1A. In summary, the posttest scores indicate that (a) the two experimental groups significantly outperformed the control group on the delayed productive posttest, but not on the other posttests, and (b) no significant difference existed between the BS 4 and 20 groups regardless of the posttest, scoring system, or RI.
90 Table 14 Results of Multiple Comparisons for Posttest Scores RI Immediate
Posttests Productive
Scoring Strict
Sensitive
Receptive
Strict
Sensitive
Delayed
Productive
Strict
Sensitive
Receptive
Strict
Sensitive
Groups Control BS 4 BS 20 Control BS 4 BS 20 Control BS 4 BS 20 Control BS 4 BS 20 Control BS 4 BS 20 Control BS 4 BS 20 Control BS 4 BS 20 Control BS 4 BS 20
Control p d
BS 4 p
d
1.000 .299
0.13 0.49
.693
0.32
1.000 .653
0.08 0.35
1.000
0.26
.470 .081
0.40 0.60
1.000
0.24
.546 .075
0.38 0.60
1.000
0.27
.004 .013
0.87 1.02
1.000
0.11
.013 .006
0.84 0.96
1.000
0.06
.180 .132
0.56 0.57
1.000
0.04
.201 .118
0.54 0.58
1.000
0.06
BS 20 p d
In the present experiment, lag to test was not controlled, and the first four blocks of items in the control treatment had greater lag to test than in the experimental treatments (see 2.3.3). The advantage of the experimental treatments, therefore, may be partly attributed to lag to test rather than spacing. In order to ascertain the extent to which lag to test affected learning, a possible relationship between the posttest performance and lag to test was analysed in the control and BS 4 groups. In both groups, the target words were studied in five blocks of four items (Items 1 - 4, 5 - 8, 9
91 - 12, 13 - 16, and 17 - 20; Figure 8). If lag to test had affected learning, the advantage of the BS 4 group would have been larger for items studied in earlier blocks, which were associated with greater lag to test in the control group, than those studied in latter blocks. In order to test this possibility, the posttest scores in the control and BS 4 groups were tabulated as a function of the block used during the treatment in Table 15. Table 15, for instance, shows that with strict scoring on the immediate productive posttest, out of four items studied in the first block (Items 1 - 4), 1.65 and 2.08 items were answered correctly on average by the control and BS 4 groups, respectively. As scores on the productive posttest were the main dependent variables in this experiment (see 2.3.4), the analysis was conducted only for the productive posttest scores.
92 Table 15 Number of Correct Responses on the Productive Posttest by Block During Learning Block during learning Scoring Strict
RI Immediate
Groups Control BS 4
Delayed
Control BS 4
Sensitive
Immediate
Control BS 4
Delayed
Control BS 4
M SD M SD M SD M SD M SD
1 1.65 1.20 2.08 1.38 0.31 0.62 0.88 1.11 2.27 1.25
2 2.04 1.18 2.54 1.33 0.46 0.76 0.54 0.76 2.50 1.21
3 2.42 1.36 2.31 1.38 0.23 0.51 0.69 0.88 2.77 1.18
4 2.58 1.24 2.54 1.27 0.31 0.55 1.12 1.07 2.92 1.09
5 3.00 0.94 2.85 1.26 0.31 0.55 0.81 0.85 3.31 0.84
M SD M SD M SD
2.62 1.13 0.58 0.81 1.08 1.23
2.88 1.24 0.77 0.99 1.08 1.06
2.58 1.27 0.62 0.85 1.12 0.99
2.92 1.13 0.46 0.65 1.42 1.14
3.12 1.07 0.58 0.86 1.12 0.95
Note. n = 26 for the control and BS 4 groups. The maximum score is 4 for each cell. Block 1 = Items 1 - 4; Block 2 = Items 5 - 8; Block 3 = Items 9 - 12; Block 4 = Items 13 - 16; Block 5 = Items 17 - 20 during learning. See Figure 8.
In order to determine whether any statistically significant difference existed between the two groups, the number of correct responses was entered into a three-way mixed design 2 (treatment: control / BS 4) x 5 (block during learning: 1 / 2 / 3 / 4 / 5) x 2 (RI) ANOVA. The ANOVA detected a significant interaction among the three variables, F (4, 200) = 3.41, p = .010, η2 = .06 with strict scoring, and F (4, 200) = 2.91, p = .023, η2 = .06 with sensitive scoring. The interaction between the treatment and block during learning was not significant, F (4, 200) = 0.70, p = .591, η2 = .01 with strict scoring, and F (4, 200) = 0.62, p = .646, η2 = .01 with sensitive scoring. Due to the significant three-way interaction, the simple main effect of treatment was tested to investigate where the significant differences lay. The results of the simple
93 main effect tests are summarised in Table 16. The table, for instance, shows that with strict scoring on the immediate productive posttest, no statistically significant difference existed between the control and BS 4 groups for items studied in the first block, F (1, 500) = 2.10, p = .148, and a small effect size was found (d = 0.33). The results of Table 16 can be summarised as follows: First, on the immediate productive posttest, no statistically significant difference existed between the control and BS 4 groups regardless of the block in which target words were studied, and no more than small effect sizes were found (0.03 < d < 0.40 with strict and d < 0.31 with sensitive scoring). Because the BS 4 group did not significantly outperform the control group for items initially studied in earlier blocks, which were associated with greater lag to test in the control group, it might be reasonable to assume that differential lag to test had little effect on the immediate posttest scores.
94 Table 16 Results of Simple Main Effect of Treatment on the Productive Posttest Scoring RI Block during learning F p Strict Immediate 1 2.10 .148 2 2.93 .088 3 0.16 .693 4 0.02 .895 5 0.28 .599 Delayed 1 3.90 .049 2 0.07 .792 3 2.50 .115 4 7.65 .006 5 2.93 .088 Sensitive Immediate 1 1.39 .240 2 1.71 .192 3 0.43 .514 4 0.00 1.000 5 0.43 .514 Delayed 1 2.89 .090 2 1.09 .296 3 2.89 .090 4 10.68 .001 5 3.35 .068 Note. df = (1, 500).
d 0.33 0.40 0.08 0.03 0.14 0.64 0.10 0.64 0.95 0.70 0.29 0.31 0.16 0.00 0.20 0.48 0.30 0.54 1.04 0.59
Second, with strict scoring on the delayed productive posttest, the BS 4 group significantly outperformed the control group for items initially studied in the first and fourth blocks, and medium to large effect sizes were observed (0.64 < d < 0.95). No statistically significant difference existed for items studied in the second, third, and fifth blocks, showing medium or smaller effect sizes (0.10 < d < 0.70). The results indicate that the advantage of the BS 4 group was not particularly large for items initially studied in earlier blocks (first and second) compared with those studied in latter blocks (third, fourth, and fifth), suggesting that lag to test might have had little effect on the delayed productive posttest scores with strict scoring.
95
Third, when the sensitive scoring procedure was used, on the delayed productive posttest, the BS 4 group fared significantly better than the control group for items initially studied in the fourth block, showing a large effect size (d = 1.04). No statistically significant difference existed for items studied in the first, second, third, and fifth blocks, and small to medium effect sizes were found (0.30 < d < 0.59). Once again, the results indicate that the advantage of the BS 4 group was not particularly large for items initially studied in earlier blocks. Taken together, the results suggest that lag to test might have had little effect on the posttest scores. The superiority of the experimental groups, therefore, may mostly be attributed to spacing rather than lag to test.
Questionnaire In the questionnaire given after the immediate posttest, participants were asked to indicate what they considered the optimal BS to be when studying 20 English words by choosing a number from 1 to 20. On average, the participants perceived BSs of 5.00 (3.54), 6.12 (5.07), and 7.58 (5.60) words to be most effective (SDs in parentheses) in the control, BS 4, and 20 groups, respectively. No statistically significant difference existed among the three groups in their responses, F (2, 77) = 1.87, p = .161, and no effect size was found (η2 < .01). The results indicate that (a) learners tended to believe that a relatively small BS might be effective when studying 20 English words, and (b) the three groups did not differ significantly from each other
96 in their responses.
2.3.7 Discussion Learning phase performance During retrieval practice, the control group produced the largest number of correct responses followed by the BS 4 group. As in Experiment 1A, the results may be ascribed to a difference in spacing between encounters (see 2.2.13). As the retrieval attempts occurred after a shorter ISI (3 trials) for the control group than for the other two groups (19 trials; Figure 8), the memory for the target items might have decayed more in the experimental groups at each retrieval attempt, possibly leading to the control group’s higher recall. It should be noted that the control group outperformed the BS 4 group on the first and third retrieval attempts as well, yielding medium to large effect sizes (0.69 < d < 1.26). The findings may be surprising considering that the ISI before the first and third retrievals were exactly the same (3 trials; Figure 8) in both groups. The superiority of the control group might have been caused because the BS 4 group was exposed to a larger number of items in a shorter amount of time. Specifically, although the control group was not introduced to all 20 target items until the 95th trial (end of the first cycle of the fifth block [Items 17-20] in Figure 8; 11 filler trials + 4 trials x 21 cycles = 95th trial), the BS 4 group was exposed to all target items in the first 39 trials (end of Cycle 9 in Figure 7; 3 fillers + 4 trials x 9 cycles = 39th trial). As the BS 4 group was required to learn a larger number of items in less time than the control group, the task difficulty was perhaps higher for the former,
97 potentially leading to the BS 4 group’s lower accuracy on the first and third retrievals despite the same ISIs. The advantage of the control group over the BS 4 group was larger on the third retrieval relative to the first. This is probably because the control group’s higher retrieval success on the first and second attempts facilitated the third retrieval (retrieval practice effect; see 2.1).
The results regarding the BS 4 and 20 groups were exactly the same as in Experiment 1A except on the third retrieval. The BS 4 group significantly outperformed the BS 20 group on the third retrieval in Experiment 1A, but not in this experiment. The incongruent results may stem from a difference in the retrieval format between the two experiments. More specifically, in the earlier experiment, the first and second retrievals used the receptive recall format, and the productive format was not used until the third retrieval attempt (see 2.2.7). As target words were never practised productively prior to the third retrieval, the BS 20 group in Experiment 1A might have performed rather poorly on the third attempt. In the BS 4 group, the productive format was not used until the third retrieval either. However, as the third retrievals occurred after a shorter interval (3 trials) than in the BS 20 group (19 trials; Figure 5), the BS 4 group might have significantly outperformed the BS 20 group on the third attempt in the previous experiment. The current experiment, in contrast, used only the productive recall format, and the target items had been practised in productive recall twice already prior to the third retrieval. The BS 20 group’s retrieval success on the first and second productive retrievals might have facilitated subsequent recall (Baddeley, 1997,
98 p. 112; Ellis, 1995), allowing the group to perform reasonably well on the third productive retrieval despite the rather long ISI. As a result, the advantage of the BS 4 group was perhaps smaller in comparison with Experiment 1A, possibly leading to the lack of statistical significance on the third attempt in this experiment.
Posttest performance The present experiment demonstrated the superiority of the BS 4 and 20 groups over the control group. The advantage was particularly large on the delayed productive posttest, where the experimental groups significantly outperformed the control group, producing large effect sizes (0.84 < d < 1.02). The difference between the BS 4 and 20 groups was relatively small regardless of the posttest, scoring system, or RI (p > .693, 0.04 < d < 0.32), mirroring the results of Experiment 1A. In this experiment, the control and experimental treatments were not matched in lag to test. The post-hoc analysis, however, suggested that differential lag to test might have had little effect on the posttest scores (see 2.3.6). As a result, it might be reasonable to assume that the superiority of the experimental treatments was due mostly to spacing rather than lag to test. The findings of the current experiment seem to support all three hypotheses put forward in 2.2.13. Hypothesis 1 was supported because when spacing was equivalent, a large BS (BS 20 treatment) did not outperform a small BS (BS 4 treatment). Hypothesis 2 was also confirmed since a large BS (BS 20) outperformed a small BS with shorter spacing (control treatment). The results were also congruent with Hypothesis 3 as a small BS with longer spacing (BS 4) fared significantly better than
99 a small BS with shorter spacing (control).
In relation to the second objective, the position of initial productive retrieval was controlled in the current experiment unlike in Experiment 1A. Despite this manipulation, the results regarding the BS 4 and 20 groups in this experiment were consistent with those of the previous experiment. The finding suggests that the difference in the position of initial productive retrieval apparently had little effect on the results of the earlier experiment. Taken together, Experiments 1A and 1B indicate that (a) as long as spacing is equivalent, BS may have little effect on learning (hence, BS 4 = 10 = 20 in Experiment 1A and BS 4 = 20 in Experiment 1B), and (b) spacing may have a larger effect on learning than BS (hence, BS 4 = 20 > control in Experiment 1B). The results of Experiment 1B also suggest that the lack of a significant effect in Experiment 1A was probably due to the fact that the three BSs were matched in spacing, rather than because of the limited range of BSs used, high task complexity, relatively large SDs on the posttest scores, or a Type II error (see 2.2.13). This is because when spacing was not equivalent, a BS of 20 words (BS 20 treatment) significantly outperformed a BS of four words (control treatment) in Experiment 1B.
Although the present experiment demonstrated the advantage of the BS 4 and 20 groups, the results were not consistent across the RIs or posttests. The experimental groups fared significantly better than the control group on the delayed productive
100 posttest, but not on the immediate productive test. The results may be in part explained by the ISI-RI interaction, which refers to a phenomenon where short spacing tends to be effective at a short RI, whereas long spacing tends to be effective at a long RI (e.g., Balota, Duchek, & Paullin, 1989; Bird, 2010; Cepeda, Pashler, Vul, Wixted, & Rohrer, 2006; Cepeda et al., 2008, 2009; Glenberg & Lehmann, 1980; Glenberg, 1976; Pashler, Rohrer, Cepeda, & Carpenter, 2007; Rohrer & Pashler, 2007). Due to the ISI-RI interaction, the treatments with longer spacing (BS 4 and 20) might have been particularly effective on the delayed posttest (see 4.1 for further details about the ISI-RI interaction).
The effects of the treatments were also conditional upon the type of posttest. While the superiority of the experimental groups was found on the productive posttest, no significant difference existed among the three groups on the receptive test. The results may be partially due to the test order. At each test administration, the productive posttest was given prior to the receptive test (see 2.2.10). As correct responses in the latter were used as cues in the former, the productive test might have affected performance on the receptive test, possibly diminishing a potential difference among the three groups. Alternatively, the direction of learning could be partially responsible for the results. In Experiment 1B, target words were practised only in a productive format. Because productive learning may have a greater effect on productive tests than receptive tests, significant differences might have been found only on the productive posttest.
101
Pedagogical and methodological implications The present study indicated that spacing may have a larger effect on learning than BS. Pedagogically, the findings imply that introducing a large amount of spacing between encounters may be more important than using a particular BS, especially for improving long-term retention. As pointed out earlier (see 2.1), researchers, teachers, learners, and materials developers seem to believe that a small BS may enhance learning more than a large one. The results of the present study, however, indicate that (a) there is no magic BS and learners should study with what they are comfortable with, and (b) it may be useful to pay more attention to spacing rather than BS.
The findings of this study underscore the importance of spacing in L2 vocabulary learning. The results may have important pedagogical implications because although research shows that spacing may have a large effect on learning (e.g., Cepeda et al., 2006, 2009; Dempster, 1989; Janiszewski, Noel, & Sawyer, 2003), its benefits have not been exploited fully in traditional instructional settings (Cepeda et al., 2009; Dempster, 1988; Rohrer et al., 2005; Sobel, Cepeda, & Kapler, 2011). Furthermore, learners are often unaware that spacing increases learning (Hartwig & Dunlosky, 2011; Kornell, 2009; Wissman et al., 2012). Based on the results of this study, it may be useful to raise awareness of the importance of spacing.
Although BS was found to have little effect on posttest performance, the present study
102 nonetheless suggested possible advantages and disadvantages of using different BSs. One benefit of a small BS may be that it might increase learning phase performance and thus motivate learners. In both experiments, a small BS produced significantly more correct responses during retrieval practice than a large one (Experiment 1A: 4 > 10 > 20; Experiment 1B: 4 > 20). As incorrect responses during learning may potentially demotivate learners (Fritz, 2010; Logan & Balota, 2008; Mondria & Mondria-de Vries, 1994), the use of a small BS may be more desirable. The results of the questionnaire, which showed that participants considered a relatively small BS to be more helpful for learning (see 2.2.12 and 2.3.6), may also indicate that a small BS may have a positive effect on learners’ motivation. A disadvantage of using a small BS, however, may be that it may possibly lead to under-learning. That is, a high probability of retrieval success caused by a small BS may create what Kornell (2009) refers to as ‘an illusion of effective learning’ (p. 1302), and learners may stop studying before lexical items are actually acquired, resulting in under-learning. Ideal flashcard software would keep a record of the learner’s performance on individual items and ensure that under-learning would not occur.
Another implication of this study is that learning phase performance may not necessarily be a good index of long-term retention (e.g., Bjork, 1994, 1999; Ellis, 1995; Pashler et al., 2003). In Experiment 1A, although the BS 4 group led to the best learning phase performance, no statistically significant difference was found among the three groups in their posttest scores. Similarly, in Experiment 1B, the control
103 treatment, which produced the largest number of correct responses during retrieval practice, turned out to be the least effective 1 week after the treatment. The results seem to be consistent with the desirable difficulty framework (Bjork, 1994, 1999; Schmidt & Bjork, 1992), according to which a treatment that increases initial rate of acquisition does not always enhance long-term retention (see 3.1.2). Pedagogically, the findings indicate that it may be useful to raise awareness that making mistakes during learning is not necessarily a sign of ineffective learning. For research methodology, the findings suggest that it might not be valid to use learning phase performance as an index of long-term retention (Bjork, 1994, 1999; Ellis, 1995; Schmidt & Bjork, 1992). For instance, some empirical studies have used trials to criterion, or the number of trials required for learners to successfully recall a target item a given number of times during learning, as dependent measures (e.g., Higa, 1963; Tinkham, 1993, 1997; Waring, 1997b). However, considering that a treatment that maximises learning phase performance does not necessarily lead to superior long-term retention, other types of dependent variables such as posttest scores might be a more direct and valid measure of learning outcomes.
The findings of this study might apply not only to flashcard learning but also to vocabulary learning in general. For instance, suppose teachers want students to acquire 1,000 words over a year. One may wonder how these 1,000 items should be introduced. Would it be more effective to teach 30 words per week, for instance, or would it be more effective to introduce a larger number of items at once? The findings
104 of this study imply that the number of words studied per week may not have a very large effect on learning and that the amount of spacing could be a more important factor. Nonetheless, the issue of whether the results of this study also apply to instructional activities other than paired-associate learning awaits future research. Furthermore, as the duration of the present study was rather short, the findings of this study may not necessarily be applicable to treatments distributed over a longer period of time. In future research, it may be valuable to investigate the effects of BS and spacing over time.
105 Chapter 3. STUDY 2: RETRIEVAL FORMAT In flashcard learning, retrieval increases learning (Barcroft, 2007; Cull, 2000; Karpicke & Roediger, 2007b, 2008; Karpicke, 2009; Landauer & Bjork, 1978; McNamara & Healy, 1995; Metcalfe & Kornell, 2007; Royer, 1973). Retrieval practice can be categorised into four types: receptive recall, productive recall, receptive recognition, and productive recognition (Laufer et al., 2004; Laufer & Goldstein, 2004). In receptive recall, learners are asked to produce the meaning of target words while in productive recall, they produce the target word form corresponding to the meaning provided. Receptive recognition requires learners to choose, rather than to produce, the correct meaning of target words from a number of options, whereas productive recognition requires learners to choose the target word form corresponding to the meaning provided (Laufer et al., 2004; Laufer & Goldstein, 2004). In order to optimise flashcard learning, it is necessary to determine which kinds of retrieval practice should be used. For instance, is productive retrieval more effective than receptive? Which is more effective, recall or recognition? Does a combination of recognition and recall increase learning more than either one alone?
Previous studies suggest that receptive retrieval promotes larger gains in receptive vocabulary knowledge while productive retrieval is beneficial for gaining productive knowledge (Griffin & Harley, 1996; Mondria & Wiersma, 2004; Steinel et al., 2007; Stoddard, 1929; Waring, 1997a; S. A. Webb, 2009a, 2009b). The findings imply that in order to gain both receptive and productive vocabulary knowledge efficiently, it
106 may be useful to practise receptive as well as productive retrieval (Griffin & Harley, 1996; Mondria & Wiersma, 2004; Nation & Webb, 2011, p. 41; Nation, 2001, p. 306; S. A. Webb, 2002, 2009b). As for the recall / recognition dichotomy, previous studies have found that recall may enhance learning more than recognition (Bjork & Whitten, 1974; Butler & Roediger, 2007; Carpenter & DeLosh, 2006; Duchastel, 1981; Foos & Fisher, 1988; Glover, 1989; Kang et al., 2007; McDaniel, Anderson, Derbish, & Morrisette, 2007). However, the existing studies examined the retention of reading materials, lecture materials, or known L1 words (see 3.1.2). Thus, it is not clear to what extent their findings may be applicable to L2 vocabulary learning. Van Bussel (1994, Experiment 2) looked into L2 vocabulary learning and constitutes the only exception. However, due to several limitations (see 3.1.2 for details), his study may not necessarily enable us to identify the optimal retrieval format for flashcard learning.
Study 2 investigated the effects of recognition and recall on L2 vocabulary learning to determine how this factor may influence flashcard learning. The following four learning conditions were compared: recognition, recall, combined, and highest difficulty. In the recognition condition, target items were practised in receptive and productive recognition formats, whereas the recall condition consisted of receptive and productive recall. In the combined condition, target items were studied in receptive recognition, productive recognition, receptive recall, and productive recall. In the highest difficulty condition, target items were learnt only in the most
107 demanding format, productive recall. By comparing the effectiveness of these four conditions, Study 2 aimed to find out which kinds of retrieval practice should be used in order to optimise flashcard learning.
3.1 Review of Literature
3.1.1 Effects of receptive and productive retrieval Previous studies on receptive and productive retrieval indicate that receptive retrieval promotes larger gains in receptive vocabulary knowledge while productive retrieval is more beneficial for gaining productive knowledge (Griffin & Harley, 1996; Mondria & Wiersma, 2004; Steinel et al., 2007; Stoddard, 1929; Waring, 1997a; S. A. Webb, 2009a, 2009b). These findings may be explained by transfer-appropriate processing (TAP) theory (Bransford, Franks, Morris, & Stein, 1979; Morris, Bransford, & Franks, 1977), according to which performance is enhanced if the testing condition corresponds to that of learning. TAP theory predicts that in order to gain both receptive and productive vocabulary knowledge efficiently, it may be valuable to practise receptive as well as productive retrieval. The study done by Mondria and Wiersma (2004) offers support for this prediction. They compared the following three conditions: receptive, productive, and receptive + productive. Mondria and Wiersma found that the receptive + productive condition led to comparable gains in productive knowledge to the productive condition, and it was as effective as the receptive condition on a receptive posttest. Based on their findings, Mondria and Wiersma (2004), along with others (Griffin & Harley, 1996; Nation & Webb, 2011, p. 41;
108 Nation, 2001, p. 306; S. A. Webb, 2002, 2009b), recommend that L2 words should be learnt both receptively and productively.
Previous studies also demonstrate that learning is bi-directional. In other words, learners can gain some receptive knowledge from productive retrieval and vice versa (Griffin & Harley, 1996; Mondria & Wiersma, 2004; Schneider et al., 2002; Steinel et al., 2007; Stoddard, 1929; Waring, 1997a; S. A. Webb, 2002, 2009a, 2009b). It has also been shown that productive retrieval leads to relatively large gains in receptive knowledge, whereas receptive retrieval results in only small gains in productive knowledge (Griffin & Harley, 1996; Mondria & Wiersma, 2004; Schneider et al., 2002; Steinel et al., 2007; Stoddard, 1929; Waring, 1997a; S. A. Webb, 2002, 2009a). These findings may be partially explained by the desirable difficulty framework (Bjork, 1994, 1999; Schmidt & Bjork, 1992), according to which more demanding tasks lead to better long-term retention and transfer than less demanding ones (see Ellis, 1995; Griffin & Harley, 1996; Schneider et al., 2002; Steinel et al., 2007, for a similar discussion). Transfer in this context refers to a situation where the testing condition does not correspond to that of learning. For instance, if target words are practised productively but tested receptively, the posttest measures transfer (Schneider et al., 2002). Because productive retrieval is more demanding than receptive (Griffin & Harley, 1996; Mondria & Wiersma, 2004; Schneider et al., 2002; Waring, 1997a), the desirable difficulty framework predicts that productive retrieval may facilitate transfer more than the latter. This may be partly the reason why
109 productive retrieval tends to be effective even on receptive posttests, where the testing condition does not match that of learning (Schneider et al., 2002).
In summary, previous studies on receptive and productive retrieval may allow us to make two recommendations regarding the optimal way to learn from flashcards. First, studies suggest that in order to gain both receptive and productive vocabulary knowledge efficiently, it may be useful to practise receptive as well as productive retrieval. Second, if only one direction has to be chosen, productive retrieval may be more desirable because it may result in adequate gains in receptive knowledge as well as large gains in productive knowledge.
3.1.2 Effects of recall and recognition Theoretical background: Effects of recall and recognition Research shows that recall may enhance learning more than recognition (e.g., Bjork & Whitten, 1974; Butler & Roediger, 2007; Carpenter & DeLosh, 2006; Duchastel, 1981; Foos & Fisher, 1988; Glover, 1989; Kang et al., 2007; McDaniel et al., 2007; see below for details). Several explanations have been offered to explain the advantage of recall over recognition, including the desirable difficulty framework, retrieval effort hypothesis, and generation effect. First, as described in 3.1.1, the desirable difficulty framework (Bjork, 1994, 1999; Schmidt & Bjork, 1992) states that more demanding tasks lead to better long-term retention and transfer than less demanding ones. This framework may also account for the benefits of recall over
110 recognition because recall may be more difficult for the learner than recognition (Butler & Roediger, 2007; Kang et al., 2007; Laufer et al., 2004; Laufer & Goldstein, 2004). Second, the retrieval effort hypothesis (Pyc & Rawson, 2009), which is derived from the desirable difficulty framework, may also offer explanations for the superiority of recall. According to this hypothesis, the degree to which a successful retrieval enhances memory increases with the difficulty of the retrieval practice (see Karpicke & Bauernschmidt, 2011; Karpicke & Roediger, 2010; Logan & Balota, 2008, for a similar discussion). As recall may require greater retrieval effort than recognition on the part of learners (Butler & Roediger, 2007; Carpenter & DeLosh, 2006; Kang et al., 2007), the retrieval effort hypothesis also predicts that recall may be more effective than recognition. Third, Kang et al. (2007) explain the advantage of recall in terms of the generation effect (Slamecka & Graf, 1978). The generation effect refers to the phenomenon where generating a response yields superior long-term retention to mere presentation. According to this effect, recall may facilitate learning more than recognition because the former involves generation while the latter does not. For instance, while receptive recall requires learners to produce the meaning of target words, receptive recognition requires learners to choose, rather than to produce, the correct meaning of target words and does not involve generation.
Studies using reading and lecture materials Empirical studies on reading (Duchastel, 1981; Foos & Fisher, 1988; Glover, 1989; Kang et al., 2007; McDaniel et al., 2007) and lecture materials (Butler & Roediger,
111 2007) have corroborated the view that recall may facilitate learning more than recognition. Butler and Roediger (2007), for instance, investigated the effects of recall and recognition on the retention of lecture materials. There were four learning conditions in their experiment: multiple-choice, short answer, study, and control. The multiple-choice condition used a recognition format while the short answer condition used a recall format. In all four conditions, participants first saw a videotaped lecture on art history. In the multiple-choice condition, participants answered multiple-choice (recognition) comprehension questions about the lecture. In the short answer condition, participants answered the same comprehension questions as in the multiple-choice condition. However, they were not given multiple-choice options and were required to generate a response by themselves. In the study condition, participants read a summary of the lecture after watching it. In the control condition, participants only watched the lecture, and no additional treatment was given. Approximately 1 month after the treatment, participants took a posttest, which consisted of short-answer (recall) comprehension questions about the art history lecture. Butler and Roediger (2007) discovered that (a) the short answer condition was the most effective among the four conditions, and (b) the multiple-choice and study conditions significantly outperformed the control condition although they were significantly less effective than the short answer condition. Their findings suggest that recall (short answer) may enhance the retention of lecture materials more than recognition (multiple-choice). Duchastel (1981) conducted a similar study using a history text and also found the superiority of recall over recognition.
112
One possible limitation of Butler and Roediger (2007) and Duchastel (1981), though, may be that they administered only a recall posttest. Thus, it is not clear to what extent their findings were due to TAP (Bransford et al., 1979; Morris et al., 1977). In other words, it is possible that a recall task fared better than recognition simply because the testing format (recall posttest) matched that of learning. In order to separate the effects of the retrieval format (i.e., recall and recognition) and TAP, it may be valuable to measure learning by a recognition posttest as well. Foos and Fisher (1988), Glover (1989), Kang et al. (2007), and McDaniel et al. (2007) also demonstrated that recall may facilitate the retention of reading materials more than recognition. Unlike Butler and Roediger (2007) and Duchastel (1981), these four studies gave both recall and recognition posttests (Foos & Fisher, 1988; Glover, 1989; Kang et al., 2007) or a recognition posttest (McDaniel et al., 2007) and found the advantage of recall. These studies indicate that recall may be more effective than recognition regardless of the type of posttest and that the superiority of recall may be independent of TAP.
L1 and L2 vocabulary studies One may wonder if recall also enhances the retention of vocabulary more than recognition. Four studies compared the efficacy of recognition and recall on the retention of L1 (Bjork & Whitten, 1974; Carpenter & DeLosh, 2006; Clariana & Lee, 2001) or L2 vocabulary (Van Bussel, 1994, Experiment 2). Compared with the
113 previous studies on reading and lecture materials, the vocabulary studies found only a limited advantage of recall. In Carpenter and DeLosh (2006, Experiment 1), for instance, 70 American undergraduate students were asked to remember 128 common English nouns. The target words were presented in 16 blocks of eight words. The presentation of each block was followed by one of the following four tasks: recognition, free recall, cued recall, and restudy. In the recognition task, participants were given a list of words and indicated which words had appeared in the presentation. The free recall task required participants to write down as many target words as they remembered. In the cued recall task, participants were given the first letter of each target word together with the number of letters and asked to complete the target word. In the restudy condition, participants were presented with the target words for the second time. Five minutes after the treatment, three kinds of posttests were administered: recognition, free recall, and cued recall. The posttests were exactly the same as the corresponding tasks in the treatment except that the recognition posttest used different distractors from the recognition task. Results can be summarised as follows: (a) on the recognition posttest, no statistically significant difference existed among the four conditions, (b) on the free recall posttest, the free recall, cued recall, and restudy conditions significantly outperformed the recognition condition, and (c) on the cued recall posttest, the free recall and restudy conditions fared significantly better than the cued recall and recognition conditions.
Based on these findings, Carpenter and DeLosh (2006) argue that (a) recall may
114 enhance the retention of vocabulary more than recognition, and (b) the advantage of recall may not be due to TAP. According to TAP theory, learning in a recall condition should be superior to learning in a recognition condition on a recall posttest, and learning through recognition should be superior to learning through recall on a recognition posttest. However, in their experiment, the recognition condition was no more effective than the free or cued recall condition on the recognition posttest. Based on these findings, Carpenter and DeLosh argue that recall may enhance retention more than recognition because it involves more elaborative processing than recognition, rather than because of TAP. Bjork and Whitten (1974, Experiment 3) also found that L1 known words practised in a free recall task were recalled 35% more than those practised in a recognition task. Bjork and Whitten, however, did not conduct statistical analysis, and it is not clear whether the difference was statistically significant. Another limitation may be that they gave only a recall posttest, and their findings may be partially due to TAP.
Clariana and Lee (2001) looked into the learning of technical meanings of known L1 words. In their study, 133 American graduate students studied 35 technical words in the field of instructional design (e.g., goal, transfer, cluster analysis) with a computer program. The participants were assigned to one of the following three treatments: recognition, recall, and combined. In the recognition treatment, participants were presented with a definition of a target word (e.g., ‘a broad statement of instructional intent expressed as what learners will be able to do’ for goal; Clariana & Lee, 2001, p.
115 27) and asked to choose the most appropriate technical word from multiple-choice options. Participants in the recall group were required to type the appropriate word corresponding to the definition. The combined treatment used a combination of recognition and recall formats. More specifically, each target item was initially practised in a multiple-choice format. When participants chose their response, the same item was studied immediately in a recall format. Learning was measured by a recall posttest. Clariana and Lee found that (a) the recall group scored higher than the recognition group although the difference was not statistically significant, and (b) the combined group significantly outperformed the recognition group, but not the recall group. Their findings may be significant in that they indicate that a combination of recognition and recall may increase learning more than recognition alone. One limitation of their study, though, may be that the recognition and combined treatments were not controlled for the frequency of retrievals. While there were two retrieval attempts per target word in the combined group (one multiple-choice + one recall), target items were practised only once in the recognition group. The superiority of the combined group, therefore, may be partly due to the difference in the retrieval frequency rather than the retrieval format.
Van Bussel (1994, Experiment 2) investigated the effects of recall and recognition on L2 vocabulary learning. In his study, 32 speakers of Dutch studied 40 English words. The participants were randomly assigned to the recall and recognition conditions. At the beginning of the treatment, participants were presented with the target words.
116 After the initial presentation, in the recall condition, participants were presented with a cloze sentence and asked to supply an appropriate target word to complete the sentence. In the recognition condition, participants were presented with a sentence containing a target word and judged whether the target word was used appropriately in the sentence. Feedback was provided after each response in both conditions. Learning was measured by recall and recognition posttests. Van Bussel (1994) did not find any significant difference between the recall and recognition conditions in their posttest scores.
Even though the findings of Van Bussel (1994) are very valuable, his study may suffer from at least two limitations. First, because vocabulary was practised in a cloze format instead of a paired-associate format, his findings may not necessarily be applicable to flashcard learning. Second, Van Bussel does not provide detailed information about his methodology or results. For instance, his method section does not give information such as the following: (a) what kind of information was provided as feedback (e.g., L2 target word, L1 translation, or cloze sentence), (b) whether the posttest was receptive or productive, (c) what the interval was between the treatment and posttest, (d) what was given as the cue in the posttest (e.g., L2 target word, L1 translation, or cloze sentence), and (e) how many options were used in the recognition posttest. In the results section, Van Bussel does not provide the mean, SD, F value, p value, or effect size for the comparison of the recall and recognition conditions either. The lack of sufficient information regarding the methodology and results makes interpretation of
117 his study difficult.
In summary, compared with the previous studies on reading and lecture materials, the L1 and L2 vocabulary studies demonstrated rather limited benefits of recall. In Carpenter and DeLosh (2006), recall fared significantly better than recognition on the recall posttests. Yet, the difference did not reach statistical significance on the recognition posttest. Bjork and Whitten (1974) showed that recall outperformed recognition although the difference might not have been statistically significant. Clariana and Lee (2001) found that the recall group scored higher than the recognition group, but the difference was not statistically significant. Van Bussel (1994) did not find any significant difference between recall and recognition.
Even though the findings of Bjork and Whitten (1974), Carpenter and DeLosh (2006), and Clariana and Lee (2001) are very valuable, their results may need to be interpreted with caution because they examined the retention of L1 vocabulary, and their findings may not necessarily be applicable to L2 vocabulary learning. Van Bussel (1994, Experiment 2) looked into L2 vocabulary learning and constitutes an exception. However, due to several limitations (see above), his study may not necessarily allow us to identify the optimal retrieval format for L2 vocabulary learning from flashcards. With the limitations of the existing studies in mind, Study 2 investigated the effectiveness of the recognition and recall formats on L2 vocabulary learning in order to determine how to optimise flashcard learning.
118
Effects of a combination of recall and recognition Although most previous studies examined the effectiveness of either recall or recognition, it may also be useful to investigate whether a combination of recognition and recall (hereafter referred to as the combined treatment) increases learning more than either one alone. The combined treatment may be effective according to the retrieval practice effect and the retrieval effort hypothesis. The retrieval practice effect (Baddeley, 1997, p. 112; Ellis, 1995) refers to the phenomenon where a successful retrieval from memory yields superior long-term retention to an unsuccessful retrieval (see 2.1). As described earlier, the retrieval effort hypothesis (Pyc & Rawson, 2009) states that the degree to which a successful retrieval enhances memory increases with the difficulty of the retrieval practice. Taken together, the retrieval practice effect and retrieval effort hypothesis imply that when a new lexical item is introduced, it should be practised in a relatively easy format. Otherwise, retrieval may be unsuccessful, and learners may not be able to benefit from the positive effects of retrieval success. The second retrieval can be more difficult than the first one because success on the initial retrieval may facilitate subsequent retrieval, allowing the target word to be successfully retrieved in a more demanding format. Similarly, retrieval difficulty can be gradually increased as learning proceeds.
The above discussion suggests that it may be most effective to make retrieval progressively difficult because it may help increase retrieval effort while at the same
119 time, facilitating successful retrieval (Finley et al., 2011; Karpicke & Roediger, 2010; Logan & Balota, 2008). This may be achieved by using a combination of recall and recognition. Because recognition has been found to be easier than recall (Butler & Roediger, 2007; Kang et al., 2007; Laufer et al., 2004; Laufer & Goldstein, 2004), by offering recognition questions in earlier stages and introducing a recall format later, the combined treatment may allow learners to benefit from the positive effects of both retrieval success and retrieval effort. Using only recognition formats, in contrast, may not be ideal because although it may facilitate successful retrieval, retrieval may not be as effortful. This is not effective according to the retrieval effort hypothesis. At the same time, using only recall formats may be less effective than the combined treatment because although retrieval may be effortful, it may not necessarily produce high retrieval success. This is not desirable based on the retrieval practice effect. From a pedagogical perspective, it may be valuable to investigate whether the combined treatment maximises L2 vocabulary learning.
As described earlier, Clariana and Lee's (2001) study offers empirical support for the prediction that the combined treatment may increase learning more than recognition alone. However, the recognition and combined treatments in their study were not controlled for the frequency of retrievals, and the superiority of the combined treatment may be partially due to the difference in the retrieval frequency (see above). Furthermore, as Clariana and Lee (2001) looked into the learning of L1 technical vocabulary, their findings may not necessarily be applicable to L2 vocabulary learning.
120 In order to examine the value of the combined treatment for flashcard learning, the present study compared the effectiveness of recognition, recall, and the combined treatment on L2 vocabulary learning while controlling for retrieval frequency.
Effects of using only productive recall In addition to recognition, recall, and the combined treatment, it may also be useful to examine the effects of using only a productive recall format (hereafter, a treatment that involves only productive recall will be referred to as the highest difficulty treatment because productive recall is more difficult than receptive recognition, productive recognition, and receptive recall; Laufer et al., 2004; Laufer & Goldstein, 2004). There exist conflicting views about the effectiveness of the highest difficulty treatment. First, as productive recall tends to produce a low rate of retrieval success (Laufer et al., 2004; Laufer & Goldstein, 2004), the highest difficulty treatment may not be effective based on the retrieval practice effect (e.g., Baddeley, 1997, p. 112; Ellis, 1995; see 2.1). Second, TAP theory (Bransford et al., 1979; Morris et al., 1977) predicts that the highest difficulty treatment may not enhance the acquisition of receptive vocabulary knowledge because it consists only of productive retrieval. Previous studies on receptive and productive learning may also support this prediction (see 3.1.1).
In contrast, the desirable difficulty framework (see 3.1.1) suggests that the highest difficulty treatment may be effective. Because productive recall is more demanding
121 than receptive recognition, productive recognition, and receptive recall (Laufer et al., 2004; Laufer & Goldstein, 2004), the highest difficulty treatment, which consists only of productive recall, may be most effective according to the desirable difficulty framework. The desirable difficulty framework also partially contradicts the retrieval practice effect and TAP theory. First, unlike the retrieval practice effect, the desirable difficulty framework states that retrieval success may not necessarily be a reliable index of long-term retention. The highest difficulty treatment, therefore, may turn out to be effective despite a low level of retrieval success if the positive effects of difficult retrievals outweigh the negative effects of retrieval failures (e.g., Bjork, 1994, 1999; Karpicke & Bauernschmidt, 2011; Pashler et al., 2003; Schmidt & Bjork, 1992; Schneider et al., 2002).
Second, unlike TAP theory (Bransford et al., 1979; Morris et al., 1977), the desirable difficulty framework suggests that the highest difficulty treatment may enhance the acquisition of not only productive but also receptive knowledge. This is because a difficult learning condition may facilitate transfer according to the desirable difficulty framework (Bjork, 1994, 1999; Schmidt & Bjork, 1992). In summary, whereas the retrieval practice effect and TAP theory suggest that the highest difficulty treatment may not be conducive to learning, it may be effective according to the desirable difficulty framework. In order to test these conflicting theories, the effects of the highest difficulty treatment were also examined in the current study.
122 3.2 The Present Study The review of literature indicates that recall may enhance retention more than recognition. Recall is also found to lead to large gains in knowledge on a recognition posttest, and the advantage of recall might be independent of TAP. However, the existing studies examined the retention of reading materials, lecture materials, or known L1 words. Thus, it is not clear to what extent their findings may be applicable to L2 vocabulary learning. Van Bussel (1994, Experiment 2) looked into L2 vocabulary learning and constitutes the only exception. Yet, due to several limitations, his study may not necessarily enable us to identify the optimal retrieval format for flashcard learning (see 3.1.2). Study 2 investigated the effects of recognition and recall formats on L2 vocabulary learning in a paired-associate format to determine how this factor may influence flashcard learning.
Study 2 also examined the effectiveness of the combined treatment (a combination of recognition and recall). The retrieval practice effect and the retrieval effort hypothesis suggest that a combined treatment may maximise learning. The study conducted by Clariana and Lee (2001) partly supports this prediction (see 3.1.2). Yet, as their study looked into the learning of L1 technical vocabulary, it is not clear to what extent their findings may be applicable to L2 vocabulary learning. In order to investigate the value of the combined treatment for flashcard learning, the effects of the combined treatment were also examined in this study.
123 Study 2 also tested the effectiveness of the highest difficulty treatment, where L2 words are practised only in productive recall. According to the desirable difficulty framework, the highest difficulty treatment may enhance learning. At the same time, the retrieval practice effect and TAP theory predict that the effectiveness of the highest difficulty treatment may be limited (3.1.2). In order to test these conflicting theories, this study also investigated the efficacy of the highest difficulty treatment. By comparing the effects of recognition, recall, the combined treatment, and highest difficulty treatment, the present study may help us to determine which kinds of retrieval practice should be used in order to optimise flashcard learning.
Research questions The following three research questions were addressed in this study: 1. Which is more effective for L2 vocabulary learning, recall or recognition? 2. Is a combined treatment more effective than recall or recognition alone for L2 vocabulary learning? 3. Is the highest difficulty treatment effective?
3.3 Pilot Studies Pilot studies were conducted with 16 native speakers of English in New Zealand to identify any potential problems with the methodology of this study. In the pilot studies, a ceiling effect was observed on the recognition posttests. In order to increase task difficulty and reduce the probability of a ceiling effect, four changes were made to the
124 methodology. First, the number of target items was increased from 40 to 60. Second, the number of encounters with target words during the treatment was reduced from six to five. Third, the duration of the initial presentation was reduced from 8 to 7 seconds per item. Fourth, the duration of the filler task after the treatment was increased from 10 questions to 5 minutes.
In order to determine how long the duration of feedback should be, two types of feedback were used in the pilot studies: computer-paced and self-paced feedback. In the former, the feedback duration was fixed to 5 seconds per response as in Study 1. In the latter, feedback was self-paced by participants, and participants were allowed to close the feedback window before 5 seconds elapsed. Pilot studies suggested that (a) when feedback is self-paced, learners may spend more time on feedback after recall questions than recognition, and (b) computer-paced feedback may increase posttest scores compared with self-paced feedback. Based on the results of the pilot studies, both types of feedback were used in the main data collection by manipulating the feedback pacing between participants. In other words, feedback was self-paced for half of the participants, whereas the feedback duration was fixed for the other half. This is because there seem to be advantages and disadvantages to both types of feedback.
One benefit of self-paced feedback may be that it may reduce the probability of a ceiling effect because pilot studies indicated that self-paced feedback may be of a
125 shorter duration than computer-paced feedback and decrease posttest scores. Second, self-paced feedback may increase ecological validity. Pilot studies suggested that recall may require a longer feedback duration than recognition. This is probably because recall questions produced more incorrect responses during learning compared with recognition, encouraging participants to study feedback more carefully. The longer feedback duration after recall relative to recognition may be representative of authentic flashcard learning. As a result, self-paced feedback may increase ecological validity by allowing learners to spend more time on feedback after recall questions. Third, self-paced feedback may also reflect authentic computer-based learning. In eight of the nine flashcard programs surveyed by Nakata (2011), feedback is self-paced by learners. Self-paced feedback, hence, may increase ecological validity because it appears to be a common feature among existing flashcard software.
A disadvantage of self-paced feedback may be that it might not be possible to separate the effects of the retrieval format (i.e., recall or recognition) and feedback duration. As noted above, if feedback is self-paced, recall may lead to a longer feedback duration than recognition. If the duration of feedback is not equivalent, the results of the present study may be attributed in part to a possible difference in the feedback duration rather than the retrieval format per se. Given the advantages and disadvantages to self- and computer-paced feedback, both types of feedback were used in the main data collection (see 3.4.4).
126 3.4 Method
3.4.1 Participants The participants were 64 students at Victoria University of Wellington, New Zealand. All of them were native speakers of English. Their average age was 20.77 (SD = 3.91) years old. Fifteen were male, and 49 were female. None of the participants had prior knowledge of Swahili, the target language in this study. Participants received a $20 NZD shopping voucher in exchange for their participation.
3.4.2 Experimental design There were three independent variables in the current study. The first independent variable was the type of learning condition: recognition, recall, combined, and highest difficulty. The second independent variable was the retention interval (hereafter, RI): immediate and 1-week delayed posttests. The third independent variable was the pacing of feedback: self-paced and computer-paced. The present study employed a mixed design. The type of learning condition and RI were within-participant variables, whereas the feedback pacing was a between-participant variable. Figure 9 summarises the design of this study.
127
Note. n = 8 for each subgroup. a Each item set consisted of 15 items. bDifficulty = highest difficulty condition. Figure 9. Design of Study 2.
Sixty-four participants were randomly assigned to the computer-paced and self-paced groups. Each group consisted of 32 participants. Feedback was paced by participants in the self-paced group, whereas the feedback duration was fixed to 5 seconds per response for the computer-paced group (see 3.4.4). The participants in each group were randomly divided into four subgroups of eight participants. The computer-paced group was divided into Subgroups 1 - 4, and the self-paced group was divided into Subgroups 5 - 8. Sixty target word pairs were also divided into four sets of 15 items (hereafter, Sets A, B, C, and D). The four subgroups of participants in each feedback pacing group studied different sets of word pairs under different conditions, thus counterbalancing the effects of target items (see Figure 9). In other words, Subgroup 1 studied Set A under the recognition, Set B under the recall, Set C under the combined, and Set D under the highest difficulty conditions. Subgroup 2 studied Set B under the recognition, Set C under the recall, Set D under the combined, and Set A under the
128 highest difficulty conditions, and so forth. The dependent variables were effectiveness and efficiency of the four learning conditions. Effectiveness was measured by the number of correct responses on the posttest. Efficiency was defined as the number of words acquired per minute and calculated by dividing the posttest score by the study time (e.g., Kornell, 2009; Mondria, 2003; Pyc & Rawson, 2007).
3.4.3 Procedure The experiment was conducted with a computer program developed by the author with Microsoft Visual Basic for Excel Version 7.0. The experiment consisted of two sessions. Session 1 was comprised of the explanation, practice period, treatment, filler task, immediate posttest, and questionnaire. A pretest was not given because none of the participants had prior knowledge of any of the target Swahili words. In Session 2, which was conducted 1 week after Session 1, the delayed posttest was administered. Session 1 took approximately 90 minutes, and Session 2 required approximately 20 minutes.
Session 1 At the beginning of Session 1, participants received an explanation about the study. They were also given participant information sheets and asked to sign consent forms if they chose to participate. After that, participants practised using the flashcard program with four sample word pairs. The practice was followed by the treatment, where participants studied 60 Swahili-English word pairs using a flashcard program
129 (see 3.4.4). After the treatment, participants answered two-digit additions (e.g., 53 + 49 = ?) as a filler task for 5 minutes. The duration of the filler task (5 minutes) was longer than that of Study 1 (10 questions) in order to prevent a ceiling effect on the recognition posttests (see 3.3). Other than the duration, the filler task in this study was exactly the same as in Study 1. The immediate posttest was given after the filler task. There were four types of posttests: productive recall, productive recognition, receptive recall, and receptive recognition (see 3.4.6). The immediate posttest was followed by a questionnaire. In the questionnaire, participants evaluated the usefulness of the four kinds of retrieval formats (receptive recognition, productive recognition, receptive recall, and productive recall) for learning on a 7-point scale, where 1 means Not helpful at all and 7 means Very helpful.
Session 2 In order to measure retention, the delayed posttest was administered 1 week after the treatment. As in the immediate posttest, the following four types of tests were given: productive recall, productive recognition, receptive recall, and receptive recognition. The delayed posttest was administered without prior notice so that participants would not review the target words during the period between the treatment and delayed posttest.
3.4.4 Treatment In the treatment, 60 Swahili-English word pairs were studied with a flashcard program.
130 There were five cycles of 60 items, and each item was encountered only once in each cycle. The items from four sets (Sets A to D) occurred once every four items (e.g., ABCD ABCD ABCD …). This was done to ensure that the trials for the four item sets would be distributed roughly equally across the treatment because the positions of the trials might affect learning (e.g., Delaney et al., 2010; Karpicke & Roediger, 2007a).
In the first cycle, the target Swahili and English words were presented simultaneously for 7 seconds per word pair (initial presentation). Participants were asked to study both the Swahili word and its English translation. Each word pair was presented only once in the initial presentation. After 7 seconds, the program automatically proceeded to the next item. From the second cycle, target items were practised in a recall or recognition format. There were four retrieval attempts for each of the 60 items. Target items were studied in a different format depending on the condition to which they were assigned. Figure 10 summarises the retrieval formats used in each condition. In the recognition condition, target items were practised in a receptive recognition format twice and then a productive recognition format twice. The recall condition consisted of two receptive recall questions followed by two productive recall questions. In the combined condition, target items were studied once in each of the receptive recognition, productive recognition, receptive recall, and productive recall formats in that order. In the highest difficulty condition, target items were practised four times in productive recall.
131
Figure 10. Retrieval formats in the four conditions.
The first three conditions involved both receptive and productive retrieval. This is because previous studies suggest that in order to gain both receptive and productive vocabulary knowledge efficiently, learners need to practise receptive as well as productive retrieval (see 3.1.1). In the first three conditions, questions were arranged in order of increasing difficulty. More specifically, receptive formats preceded productive formats, and recognition formats were followed by recall. This is because the retrieval practice effect and the retrieval effort hypothesis indicate that gradually increasing retrieval effort may maximise learning (see 3.1.2). The order of the formats is based on Laufer et al. (2004) and Laufer and Goldstein (2004). Note that all four conditions had the same number of retrieval attempts (four) during the treatment. This was done in order to avoid confounding the effect of the retrieval format with retrieval frequency as in Clariana and Lee’s (2001) study (see 3.1.2).
As shown in Figure 11, four kinds of formats were used during retrieval practice: receptive recognition, productive recognition, receptive recall, and productive recall. In receptive recognition, participants were presented with a Swahili word and asked to
132 select the most appropriate English translation from four options (Figure 11, 1). If participants were not sure about the correct answer, they were instructed to choose the I DON’T KNOW option. After each response, feedback was provided to the participants (see below for details). In productive recognition, participants were presented with an English word and asked to pick the most appropriate Swahili translation from among four alternatives (Figure 11, 2). Other than that, the productive recognition format was exactly the same as the receptive recognition format. The productive and receptive recall formats were exactly the same as in Experiment 1A (see 2.2.7) except that Swahili-English word pairs were used instead of English-Japanese word pairs (Figure 11, 3 & 4).
133 1. Receptive recognition
2. Productive recognition
3. Receptive recall
4. Productive recall
Figure 11. Examples of the four kinds of retrieval formats.
After each response, feedback was provided to the participants (Figure 12). The pacing of feedback was manipulated between participants. For the computer-paced group, the feedback duration was fixed to 5 seconds per response. The duration was set to 5 seconds because previous studies (Cepeda et al., 2009; Hays et al., 2010; Pashler et al., 2003) as well as Study 1 and pilot studies suggest that 5 seconds is sufficient for learning. The self-paced group was allowed to close the feedback
134 window before 5 seconds elapsed by pressing the Enter key or clicking the left button on the mouse. If participants did not do so within 5 seconds, the feedback window closed automatically, and the program proceeded to the next item.
Figure 12. Feedback for a correct response (left) and an incorrect response (right) in the self-paced group.
As noted above, each target item was encountered five times throughout the treatment (initial presentation + four retrieval attempts). The number of encounters was set to five for three reasons. First, the combined condition in the present study involved four types of retrieval: receptive recognition, productive recognition, receptive recall, and productive recall. Therefore, there need to be at least four retrieval attempts for each target word. This requires the minimum number of encounters to be five (initial presentation + four retrieval attempts). Second, Crothers and Suppes (1967, Experiments 8 & 9) suggest that 85 - 88% of the items in their study were acquired after six (Experiment 9) or seven encounters (Experiment 8). Considering that the
135 current study involved fewer items (60) than Crothers and Suppes (108 in Experiment 8 and 216 in Experiment 9), encounters of six may lead to a ceiling effect, reducing the potential of showing a difference between conditions. Third, in pilot studies where items were encountered six times, a ceiling effect was observed on the recognition posttests. The number of encounters, hence, needed to be fewer than six. With these considerations in mind, the target items were encountered five times throughout the treatment.
Multiple-choice options in the recognition format In the recognition format (Figure 11), participants selected the most appropriate response from four alternatives (one correct response and three distractors). The number of distractors was set to three in order to reflect a real-life study situation: Among nine flashcard programs surveyed by Nakata (2011), the most common number of distractors is three. The distractors were chosen randomly from the correct responses for other target items in order, again, to reflect a real-life study situation: A survey of flashcard programs (Nakata, 2011) showed that in four out of the five programs that automatically generate distractors for multiple-choice questions, distractors are chosen randomly from the correct responses for other target items. The position of the correct response among the four options was determined randomly by the flashcard program for each question.
The I DON’T KNOW option was also given in the recognition format, and participants
136 were instructed to choose this option if they were not sure about the correct answer. The I DON’T KNOW option was used for two reasons. First, it may discourage the participants from randomly guessing. Second, the provision of this option may reduce the risk of learning incorrect information from recognition questions. In other words, when learners are forced to select an option even when they are not confident, they sometimes confuse a distractor with the correct response and may learn false information. The I DON’T KNOW option was used because it may reduce this risk (Marsh, Agarwal, & Roediger, 2009; Marsh, Roediger, Bjork, & Bjork, 2007; Roediger, Agarwal, Kang, & Marsh, 2010; Roediger, Putnam, & Smith, 2011).
The distractors were determined randomly by the flashcard program with two constraints. First, an item that was used as a distractor for a given item in the recognition posttest was not chosen as a distractor for the same item during the treatment. For instance, as kamba, handaki, and sumu were used as distractors for hadithi in the posttest, these three words were not selected as distractors for hadithi during retrieval practice. This is because if the distractors used during learning match those used during the posttest, it may make the posttest easier. Second, for each participant, 30 out of the 60 target items were used as a distractor four times throughout the treatment, whereas the other 30 were used as a distractor five times. This was done to control the frequency of exposure to the target words.
137 Order of items As noted earlier, the target items were encountered five times throughout the treatment and repeated in a cycle of 60 items. The order of items was randomised anew for each cycle so that the item order would not offer inappropriate help in remembering (Mondria & Mondria-de Vries, 1994; Nation, 2001, pp. 306-307). The item order was determined randomly with two constraints. First, as described earlier, the items from four sets (Sets A to D) occurred once every four items (e.g., ABCD ABCD ABCD …). Second, encounters of a given item were separated by at least 28 other items. For instance, if a given item appeared as the last item in the first cycle, it did not appear until the 29th position in the second cycle. A smaller interval was not used because larger spacing generally leads to better long-term retention than shorter spacing (lag effect; see 4.1.1).
3.4.5 Target words Sixty Swahili-English word pairs (15 word pairs per condition) selected from Nelson and Dunlosky (1994) were used as target items. The number of target items was determined for the following reasons: First, in pilot studies which involved 40 word pairs (10 word pairs per condition), a ceiling effect was observed on the recognition posttests. Therefore, the number of target items needed to be larger than 40. Second, when 60 items are used, the first data collection session takes approximately 90 minutes (see 3.4.3). Considering participants’ fatigue, it was not considered feasible to use more than 60 items. With the above considerations in mind, 60 word pairs were
138 used as target items. The 60 word pairs were divided into Sets A to D so that the learning difficulty would be distributed as evenly as possible. More specifically, the four sets of items were matched for the following four variables: 1) Nelson and Dunlosky’s (1994) difficulty norms, 2) L2 word length, 3) pronounceability of L2 words, and 4) orthographic similarity of L2 words to the L1 lexicon.
1) Nelson and Dunlosky’s (1994) difficulty norms In Nelson and Dunlosky (1994), 200 native speakers of English studied 100 Swahili-English word pairs, and their knowledge was assessed by three receptive recall tests (Test 1 - 3). Scores on these three tests (hereafter, difficulty norms) were used as a measure of item difficulty because studies suggest that they may be a useful index of learning difficulty for native speakers of English (e.g., Bahrick & Hall, 2005; Karpicke & Roediger, 2008; Pyc & Rawson, 2007, 2009, 2011).
2) L2 word length Although there may be some exceptions (Laufer, 1997), previous studies have found that short words are easier to remember than long ones, a phenomenon known as the word length effect (e.g., Baddeley et al., 1975; Jalbert et al., 2011; Watkins, 1972). With this in mind, the four sets of items were also controlled for L2 word length. L2 word length was operationalised as the number of letters and syllables in Swahili words.
139 3) Pronounceability of L2 words Studies have suggested that the pronounceability of L2 words may affect vocabulary learning (Crothers & Suppes, 1967; de Groot & van Hell, 2005; Ellis & Beaton, 1993; Gathercole, Frankish, Pickering, & Peaker, 1999). Wordlikeness ratings reported by Nelson and Dunlosky (1994) were used as an index of pronounceability of Swahili words.
4) Orthographic similarity to the L1 lexicon Previous research suggests that orthographic similarity of L2 words to the L1 lexicon may influence the learnability of L2 words (de Groot, 2006; Ellis & Beaton, 1993; Laufer, 1997; Roodenrys & Hinton, 2002). In other words, the more L2 words follow orthographic patterns of L1 vocabulary, the easier it is to learn them. The average positional bigram frequency (Ellis & Beaton, 1993) and orthographic neighbourhood size (Jalbert et al., 2011) were used as measures of orthographic similarity to the L1 (English) lexicon. The former is calculated by averaging the positional frequency of all bigrams (two-letter strings) in a given word (Medler & Binder, 2005). Swahili words with high bigram frequency are orthographically similar to English words and thus, presumably easier to acquire for English speakers. Orthographic neighbourhood size refers to the number of words that have the same number of letters and are exactly the same as a given word except one letter (Jalbert et al., 2011). For instance, a Swahili word leso has three orthographic neighbours in English: less, lest, and peso. L2 words with many orthographic neighbours are similar to English words and may
140 be easy to acquire (Roodenrys & Hinton, 2002). Both the average bigram frequency and orthographic neighbourhood size were calculated using MCWord: An Orthographic Wordform Database (Medler & Binder, 2005).
The 60 word pairs were divided into Sets A to D so that they would be matched for the above four variables. Appendix D presents Swahili-English word pairs in the four sets. Table 17 summarises the variables that may affect the difficulty of the target word pairs. The Kruskal-Wallis nonparametric tests showed that no statistically significant difference existed among the four sets in any of the variables, H (3) < 3.93, p > .270. Although the lack of statistical significance does not necessarily guarantee that the four sets are completely equivalent in their difficulty, it was judged that a possible difference, if any, might not have a major effect on the results of the present study because effects of target items would be counterbalanced across participants (Figure 9).
141 Table 17 Variables Related to Difficulty of Target Word Pairs Nelson & Dunlosky’s
Orthographic similarity L2 word length
Pronounceability
difficulty norms
to L1 lexicon
Item Test 1
Test 2
Test 3
Letters
set A
B
C
D
Average bigram
Neighbourhood
Wordlikeness
frequency
size
rating
Syllables
M
0.11
0.39
0.62
5.47
2.47
344.52
0.73
2.60
SD
0.05
0.12
0.11
0.99
0.52
322.65
1.39
0.74
M
0.12
0.39
0.62
5.53
2.60
418.67
0.67
2.53
SD
0.06
0.11
0.08
1.13
0.51
341.58
1.18
0.64
M
0.14
0.40
0.62
5.47
2.53
567.22
0.73
2.53
SD
0.09
0.15
0.13
1.36
0.74
329.55
1.58
0.64
M
0.12
0.40
0.60
5.40
2.40
441.37
0.73
2.47
SD
0.10
0.14
0.11
1.30
0.74
353.53
1.10
0.64
Note. n = 15 for each item set.
3.4.6 Dependent measures Four types of posttests were given: productive recall, productive recognition, receptive recall, and receptive recognition (Laufer et al., 2004; Laufer & Goldstein, 2004). All 60 target words were tested, and each posttest consisted of 60 questions. Unlike the treatment, feedback was not provided in the posttest. Other than that, the posttests were exactly the same as the corresponding retrieval formats in the treatment (Figure 11). Four posttests were given to avoid favouring one learning condition over others. For instance, if only the productive recall test is given as the posttest, the results might be biased in favour of the highest difficulty condition, where only productive recall is used.
The four tests were administered in the following order: productive recall, productive recognition, receptive recall, and receptive recognition. A productive recall test was followed by a productive recognition test because correct responses in the former test
142 were given as multiple-choice options in the latter, and administering the recognition test prior to the former may affect performance on the recall test. A receptive recall test was followed by a receptive recognition test for the same reason. The productive tests were given prior to the receptive tests because administering the receptive tests earlier may have a larger effect on test scores than doing it the other way around (see 2.2.10). In order to reduce effects of the productive posttests on the receptive tests, the productive recognition test was followed by a 3-minute filler task (two-digit additions). The filler task was exactly the same as the one given immediately after the treatment (see 3.4.3) except that it lasted for 3 minutes instead of 5. The order of items in the posttests was different from that in the treatment in order to make sure that learners would not use the item order as an aid for remembering. There were eight posttests in total (productive recall, productive recognition, receptive recall, and receptive recognition tests for the immediate and delayed posttests), and a different randomised order of items was used within each posttest.
The immediate posttest was given on the same day as the treatment. In order to measure retention, the delayed posttest was administered 1 week after the treatment. The RI (retention interval) of 1 week was chosen because previous studies suggest that scores on a 1-week delayed posttest may be a good indication of retention over time (see 2.2.10). The delayed posttest differed from the immediate posttest in two respects. First, as described earlier, the order of items differed from that in the immediate posttest. Second, although the same multiple-choice options were used for
143 a given item in the immediate and delayed posttests, the computer program changed the order of multiple-choice options for the two posttests. For instance, in the immediate receptive recognition test, for the Swahili target word hadithi (story), the multiple-choice options were given in the following order: story, trench, poison, and rope. In the delayed posttest, the order was changed to poison, story, rope, and trench so that the position of the correct answer would not offer inappropriate help. Other than this, the delayed posttest was exactly the same as the immediate posttest.
In the recognition posttests, three distractors were used per question as in the treatment. In order to discourage the participants from randomly guessing, the I DON’T KNOW option was also used. The distractors were fixed for all participants so that the difficulty of the posttests would be equivalent. The distractors were determined randomly by the computer program according to four rules. First, distractors were chosen from the correct answers for other target items. This is because participants took the recall posttest prior to the recognition posttest. If words other than the correct responses for other target items are used as distractors, participants can arrive at the correct answer simply by eliminating words that they did not see on the recall test. For instance, suppose in the receptive recognition posttest, king, prince, and princess are used as distractors for malkia (queen). Note that none of these distractors is the correct English translation for any of the target words. Participants can guess that these three distractors cannot be the correct response because none of them appeared as a cue in the productive recall test. Therefore,
144 participants can infer that only queen can be the correct response without even looking at the cue, malkia. Due to the above reason, distractors were chosen from the correct answers for other target items. Second, each correct response was used as a distractor three times throughout each recognition posttest to control the exposure frequency. Third, a word pair that was used as the cue, correct response, or distractor for a given item was not used as a distractor at least for the following seven items. For instance, as jibini (cheese) was used as the correct response for the first item on the immediate productive recognition test, it was not used as a distractor until the ninth item on the same test. This was done because if a given item is used as a distractor immediately after it is used as the correct response, it may affect posttest performance. Fourth, the distractors used in the receptive recognition test were English translations of the distractors used in the productive recognition test (Laufer et al., 2004; Laufer & Goldstein, 2004). The position of the correct response was chosen randomly by the computer program so that the correct responses would be distributed evenly across the four options. That is, throughout each posttest, correct responses appeared in each of the four options 15 times (25% of the 60 questions). See Appendix E for examples of the posttests.
3.4.7 Scoring Responses on the recognition posttests were scored as either correct or incorrect. Responses on the productive recall posttest were scored using the same procedures as in Study 1: strict and sensitive (see 2.2.11 for more details and the justification of this
145 procedure). Responses on the receptive recall posttest were scored based on four rules. First, responses with spelling mistakes were marked as correct (e.g., cinammon for cinnamon). Second, plural forms of the target word were marked as correct (e.g., ornaments for ornament). Third, homonyms of the target word were scored as incorrect (e.g., yolk for yoke). Fourth, responses were marked as incorrect if they were of a different part of speech as the translation given during the treatment (e.g., scientific for science). The responses that fell into the above four categories were rare (0.89% and 0.84% of all responses in the immediate and delayed receptive recall posttests, respectively) and did not have a large effect on the overall results.
3.5 Results Study time First, let us examine whether the study time was comparable among the four learning conditions. Table 18 summarises the study time as a function of the condition and feedback pacing. For instance, the table shows that in the computer-paced group, the average study time in the recognition condition was 10.66 minutes. This consisted of the time for the initial presentation (7 seconds x 15 items = 1.75 minutes), feedback (5 seconds x 15 items x 4 retrievals = 5.00 minutes), and four retrieval trials (10.66 minutes - 1.75 minutes - 5.00 minutes = 3.91 minutes) for 15 items.
In order to test whether any statistically significant difference existed among the four conditions, the study time was analysed by a two-way mixed design 4 (condition:
146 recognition / recall / combined / highest difficulty) x 2 (feedback pacing: self-paced / computer-paced) ANOVA. The ANOVA showed significant main effects of condition, F (2, 101) = 59.85, p < .001, partial η2 = .49, and feedback pacing, F (1, 62) = 45.08, p < .001, partial η2 = .42. The interaction between the learning condition and feedback pacing was not significant, F (2, 101) = 2.12, p = .135, partial η2 = .03. These results indicate the following two things. First, the significant main effect of condition suggests that a statistically significant difference existed in study time among the four conditions. Contrasts were performed to investigate where the significant differences lay (Field, 2009). It was revealed that when collapsed across the two feedback pacing groups, all four conditions were significantly different from each other regarding their study time (p < .008; recall > highest difficulty > combined > recognition), producing medium to large effect sizes (.33 < r < .83). The study time in the recall and highest difficulty conditions was longer than in the other two probably because the former consisted only of recall questions, and typing the answer in a recall format took more time than selecting the answer in a recognition format. The recognition condition required the least study time probably because it involved only recognition formats, which tend to take less time than recall formats. Second, the significant main effect of feedback pacing indicates that the computer-paced group used significantly more time than the self-paced group. Computer-paced feedback increased the total study time by 12.86 minutes (48.79 - 35.93), which corresponds to 12.86 seconds per target word (12.86 minutes x 60 seconds / 60 target words = 12.86 seconds).
147 Table 18 Study Time (Minutes) by Condition and Feedback Pacing Learning conditions Recognition Feedback
M
SD
Recall M
Combined
Computer-paced 10.66 0.96 13.46 2.12 12.04 1.11
12.63
1.60
48.79 4.94
Self-paced
7.61
2.02
2.30
9.72
3.02
35.93 9.64
Total
9.14
2.20 11.56 3.20 10.49 2.38
11.18
2.81
42.36 9.99
8.94
SD
Total
SD
2.99
M
Highest difficulty M
9.66
SD
M
SD
Note. n = 32 for each of the computer- and self-paced groups.
Pilot studies suggested that when feedback is self-paced, learners may spend more time on feedback after recall questions than recognition (see 3.3). The feedback duration in the self-paced group was analysed to examine whether similar results were obtained in this experiment. In the self-paced group, the average feedback duration (SDs in parentheses) was 2.10 (1.37) and 2.42 (1.29) seconds per response for the recognition and recall formats, respectively. The difference was statistically significant, t (31) = -4.14, p < .001, and a large effect size was observed (r = .60). The results seem to support the prediction that recall questions may require a longer feedback duration than recognition. The feedback duration in the four learning conditions was also compared. In the self-paced group, the average feedback duration was 2.01 (1.37), 2.41 (1.29), 2.21 (1.34), and 2.57 (1.30) seconds per response in the recognition, recall, combined, and highest difficulty conditions, respectively. According to a one-way repeated measures ANOVA, a statistically significant difference existed among the four conditions, F (1.36, 42.31) = 28.12, p < .001, η2 = .48.7 Contrasts revealed that all four conditions were significantly different from
7
As Mauchly's test showed that sphericity assumptions were violated, the Greenhouse-Geiser correction was used (Field, 2009). As a result, the dfs for the ANOVA contain decimal values.
148 each other (p < .001), producing large effect sizes for all comparisons (.54 < r < .73). When collapsed across the four conditions, the average feedback duration was 2.30 (1.32) seconds per response in the self-paced group. This means that the self-paced group spent 2.70 seconds less (5.00 - 2.30 = 2.70) on feedback per response compared with the computer-paced group, where the feedback duration was fixed to 5 seconds.
Learning phase performance There were four retrieval attempts for each target word during the treatment (see 3.4.4). Table 19 summarises the number of correct responses for the four retrieval attempts. For instance, the table shows that in the computer-paced group, the average number of correct responses in the recognition condition was 10.53, 11.53, 13.19, and 14.19 out of 15 for the first, second, third, and fourth retrieval attempts, respectively. In order to determine whether any significant difference existed among the four conditions, the average number of correct responses across the four retrieval attempts (Average in Table 19) was submitted to a two-way mixed design 4 (learning conditions) x 2 (feedback pacing) ANOVA. The ANOVA detected significant main effects of condition, F (1.85, 114.73) = 363.59, p < .001, partial η2 = .85, and feedback pacing, F (1, 62) = 5.57, p = .021, partial η2 = .08. The interaction between the two variables was also significant, F (1.85, 114.73) = 6.48, p = .003, partial η2 = .09. The significant main effect of condition indicates that a statistically significant difference existed somewhere between the four conditions. The significant main effect of feedback pacing suggests that the computer-paced group significantly
149 outperformed the self-paced group.
Table 19 Number of Correct Responses During the Learning Phase Retrieval attempts 1 a
2
3
4
Average
Feedback
Conditions
M
SD
M
SD
M
SD
M
SD
M
SD
Computer-
Recognition
10.53
2.95
11.53
2.82
13.19
2.19
14.19
1.47
12.36
2.00
paced
Recall
2.22
2.11
5.59
3.21
5.09
3.68
7.44
3.71
5.09
2.80
(n = 32)
Combined
9.53
3.28
11.78
2.38
7.81
3.62
6.59
3.28
8.93
2.70
Difficulty
0.97
2.15
3.16
2.91
5.78
3.92
8.16
3.66
4.52
2.88
Self-paced
Recognition
10.44
2.88
11.56
2.76
12.53
2.70
13.34
2.31
11.97
2.38
(n = 32)
Recall
2.28
2.29
4.69
3.53
4.09
3.43
6.19
3.71
4.31
2.99
Combined
10.44
2.78
11.19
2.81
6.38
3.72
5.13
3.38
8.28
2.80
Difficulty
0.97
1.06
2.78
2.93
4.88
3.82
5.81
4.12
3.61
2.84
Total
Recognition
10.48
2.89
11.55
2.77
12.86
2.46
13.77
1.97
12.16
2.19
(n = 64)
Recall
2.25
2.18
5.14
3.38
4.59
3.56
6.81
3.73
4.70
2.90
Combined
9.98
3.05
11.48
2.60
7.09
3.71
5.86
3.38
8.61
2.75
Difficulty
0.97
1.68
2.97
2.90
5.33
3.87
6.98
4.05
4.06
2.87
Note. The maximum score is 15 for each cell. Responses in productive recall questions were scored with the strict scoring procedure (see 3.4.7). a Difficulty = highest difficulty condition.
Due to the significant main effect of condition, contrasts were used to investigate where the significant differences lay. The contrasts showed that the recognition condition produced the largest number of correct responses followed by the combined, recall, and highest difficulty conditions. All four conditions were significantly different from each other (p < .001), producing medium to large effect sizes (.39 < r < .97). Although the interaction between the condition and feedback was also significant, it will not be discussed here because the effect size was relatively small (partial η2 = .09) compared with the main effect of condition (partial η2 = .85).
150 Posttest performance Table 20 provides the immediate and delayed posttest results for the four conditions. The productive and receptive recall test scores were analysed by a three-way mixed design 4 (learning conditions) x 2 (feedback pacing) x 2 (RI: immediate / 1-week delayed) ANOVA. For the recognition posttests, a non-parametric Friedman test was used instead of ANOVA. This is because the coefficients of skewness or kurtosis for the recognition test scores were greater than 2 or smaller than -2, indicating that the distributions of these scores were significantly different from the normal distribution. As the assumption of the normal distribution was not met, a non-parametric Friedman test was used for the recognition test scores.
151
Table 20 Number of Correct Responses on the Posttests Immediate posttest Recognition Feedback
Posttests
PC-
b
Recall
Delayed posttest a
Combined
Difficulty
Recognition
Recall
Collapsed across the RIs
Combined
Difficulty
Recognition
Recall
Combined
Difficulty
M
SD
M
SD
M
SD
M
SD
M
SD
M
SD
M
SD
M
SD
M
SD
M
SD
M
SD
M
SD
P recall (strict)
7.41
3.18
9.22
3.92
8.78
3.45
9.66
3.72
3.41
2.71
4.22
2.88
3.63
3.51
4.53
3.52
5.41
2.65
6.72
3.02
6.20
3.18
7.09
3.29
paced
P recall (sensitive)
11.88
2.94
12.03
3.02
11.53
3.11
11.50
3.54
5.97
3.70
6.97
3.43
6.28
3.85
6.56
4.00
8.92
2.99
9.50
2.90
8.91
3.19
9.03
3.38
(n = 32)
P recognition
14.28
1.46
14.22
1.54
14.44
1.24
14.09
2.41
13.47
2.21
13.47
2.37
13.59
1.88
13.59
2.15
13.88
1.75
13.84
1.89
14.02
1.45
13.84
2.22
R recall
11.84
2.45
12.22
2.97
12.16
2.65
12.47
2.93
11.34
2.68
11.94
2.94
11.38
2.83
11.88
3.20
11.59
2.42
12.08
2.88
11.77
2.62
12.17
2.92
R recognition
14.25
1.74
14.22
1.64
14.38
1.41
14.19
1.96
14.06
1.76
13.97
2.51
14.13
1.77
13.88
2.37
14.16
1.72
14.09
2.03
14.25
1.55
14.03
2.11
Self-
P recall (strict)
6.06
3.56
7.44
4.27
6.66
3.98
7.25
4.28
2.47
2.31
3.53
3.74
2.88
2.66
3.97
3.10
4.27
2.72
5.48
3.72
4.77
3.03
5.61
3.44
paced
P recall (sensitive)
9.59
4.34
9.97
4.45
9.84
4.51
10.00
4.36
5.09
3.41
5.81
4.52
5.31
3.78
6.16
4.24
7.34
3.58
7.89
4.20
7.58
3.80
8.08
4.01
(n = 32)
P recognition
13.22
2.70
13.63
2.11
13.56
2.30
13.09
2.88
12.28
3.25
12.31
3.13
12.44
3.12
12.41
3.43
12.75
2.90
12.97
2.53
13.00
2.63
12.75
3.09
R recall
10.25
3.72
10.28
4.07
10.59
3.76
10.34
4.12
10.06
3.46
9.97
4.23
10.25
3.85
9.88
3.92
10.16
3.53
10.13
4.12
10.42
3.75
10.11
3.97
R recognition
13.44
2.50
13.31
3.09
13.16
2.69
13.31
3.11
12.78
3.10
13.03
2.76
13.03
2.69
13.06
3.09
13.11
2.72
13.17
2.84
13.09
2.63
13.19
3.02
Total
P recall (strict)
6.73
3.42
8.33
4.16
7.72
3.85
8.45
4.16
2.94
2.54
3.88
3.33
3.25
3.11
4.25
3.30
4.84
2.73
6.10
3.42
5.48
3.16
6.35
3.42
(n = 64)
P recall (sensitive)
10.73
3.85
11.00
3.92
10.69
3.94
10.75
4.01
5.53
3.56
6.39
4.02
5.80
3.82
6.36
4.09
8.13
3.36
8.70
3.67
8.24
3.54
8.55
3.71
P recognition
13.75
2.22
13.92
1.85
14.00
1.89
13.59
2.68
12.88
2.82
12.89
2.81
13.02
2.62
13.00
2.91
13.31
2.45
13.41
2.26
13.51
2.17
13.30
2.72
R recall
11.05
3.23
11.25
3.67
11.38
3.32
11.41
3.71
10.70
3.14
10.95
3.75
10.81
3.40
10.88
3.69
10.88
3.09
11.10
3.66
11.09
3.28
11.14
3.61
R recognition
13.84
2.18
13.77
2.50
13.77
2.22
13.75
2.61
13.42
2.58
13.50
2.66
13.58
2.33
13.47
2.76
13.63
2.32
13.63
2.49
13.67
2.22
13.61
2.62
Note. The maximum score is 15 for each cell. a Difficulty = highest difficulty condition. bP = productive; R = receptive.
152
Table 21 Results of Three-Way ANOVAs for the Recall Posttest Scores Productive recall: Strict Effects df F p partial η2 Feedback 1 3.23 .077 .05 RI 1 190.63 .000 .76 Condition 3 17.98 .000 .23 RI X Feedback 1 3.71 .059 .06 Condition X Feedback 3 0.26 .853 .00 RI X Condition 2.60 1.88 .143 .03 RI X Condition X Feedback 2.60 1.73 .170 .03
Productive recall: Sensitive df F p partial η2 1 2.65 .109 .04 1 212.45 .000 .77 2.75 2.74 .050 .04 1 2.48 .120 .04 2.75 0.91 .429 .01 2.63 2.12 .107 .03 2.63 0.37 .750 .01
Note. Some dfs contain decimal values due to the Greenhouse-Geiser correction (Field, 2009).
df 1 1 2.54 1 2.54 3 3
Receptive recall F p partial η2 4.70 .034 .07 13.36 .001 .18 0.56 .614 .01 0.79 .377 .01 1.26 .290 .02 0.63 .597 .01 0.38 .768 .01
153
Table 21 shows the results of the ANOVAs conducted for the recall posttest scores. The table indicates the following four things. First, the ANOVAs showed a significant main effect of RI on both productive and receptive recall posttests. In other words, the delayed posttest scores were significantly lower than the immediate posttest scores on both recall tests. Second, the main effect of feedback pacing proved significant on the receptive recall test (p = .034), but fell short of significance on the productive recall test (p = .077 with strict and p = .109 with sensitive scoring). This means that the computer-paced group significantly outperformed the self-paced group on the receptive recall test, but not on the productive recall. Third, the main effect of condition was significant (p < .001) with strict scoring, and approached significance (p = .050) with sensitive scoring on the productive recall test, but was not significant on the receptive recall test (p = .614). Lastly, none of the interactions was significant on either the productive or receptive recall posttest.
As the main effect of condition proved significant with strict scoring, and approached significance with sensitive scoring on the productive recall test, contrasts were performed to investigate where the significant differences lay. Table 22 presents the F values, p values, and effect sizes r for the pair-wise contrasts. For instance, the table shows that with strict scoring, the difference between the recall and recognition conditions was statistically significant, F (1, 62) = 23.00, p < .001, r = .52. Contrasts were not performed for the receptive recall test results because the main effect of condition was not significant on this test (Table 21).
154 Table 22 Results of Pair-Wise Contrasts on the Productive Recall Posttest Recognition Scoring
Conditions
Strict
Recognition
F
p
r
23.00
.000
.52
Combined
9.73
.003
Highest difficulty
42.28
.000
Recall
5.18
.026
.28
Combined
0.28
.596
Highest difficulty
2.97
.090
Recall
Sensitive
Recall
Combined
F
p
r
.37
7.31
.009
.32
.64
1.33
.253
.14
.07
3.86
.054
.24
.21
0.50
.483
.09
Difficulty
F
p
r
18.58
.000
.48
2.10
.152
.18
F
p
r
Recognition
Note. df = (1, 62). Effect sizes (r) of .10, .30, and .50 are indicative of small, medium, and large effects, respectively (Cohen, 1988).
First, let us consider the results with the strict scoring method. With strict scoring, the recall and highest difficulty conditions significantly outperformed the recognition and combined conditions, and medium to large effect sizes were observed (.32 < r < .64). The difference between the recall and highest difficulty conditions was rather small as indicated by the lack of statistical significance (p = .253) as well as the small effect size (r = .14). The contrasts also showed that the combined condition was significantly more effective than the recognition condition, producing a medium sized effect (r = .37). Taken together, the findings suggest the following order with strict scoring on the productive recall posttest: recall = highest difficulty > combined > recognition.
When the sensitive scoring procedure was used, the difference among the four conditions was smaller. With sensitive scoring, the recall and highest difficulty conditions fared better than the recognition and combined conditions as well. Yet, a statistically significant difference existed only between the recall and recognition conditions. Furthermore, only small effect sizes (.18 < r < .28) were found between
155 the recall and highest difficulty conditions on one hand and the recognition and combined conditions on the other. These results indicate that although the recall and highest difficulty conditions resulted in higher scores than the recognition and combined conditions with sensitive scoring, the advantage was smaller compared with when the strict method was used. No significant difference existed between the recall and highest difficulty conditions (p = .483, r = .09) or between the recognition and combined conditions either (p = .596, r = .07), producing very small effect sizes.
Let us now examine the recognition posttest results. As noted earlier, for the recognition tests, a non-parametric Friedman test was used instead of ANOVA because the assumption of the normal distribution was violated. Table 23 summarises the results of the Friedman tests. The Friedman tests detected no statistically significant difference among the four conditions regardless of the feedback pacing or RI. Next, in order to investigate the effects of feedback pacing, the recognition posttest scores of the self- and computer-paced groups were compared with a non-parametric Mann-Whitney test. According to the Mann-Whitney tests, when collapsed across the four conditions, the difference between the self- and computer-paced groups fell short of statistical significance on all recognition tests, producing small effect sizes: the immediate productive, U = 394.00, p = .106, r = .20, immediate receptive, U = 384.00, p = .076, r = .22, delayed productive, U = 389.50, p = .098, r = .21, and delayed receptive, U = 376.50, p = .065, r = .23.
156 Table 23 Results of Friedman Tests for Recognition Posttest Scores Immediate posttest Delayed posttest Productive Receptive Productive Receptive 2 2 2 Feedback χ p χ p χ p χ2 p Computer-paced 1.76 .624 0.74 .864 0.39 .943 1.34 .721 Self-paced 6.87 .076 4.17 .244 0.70 .874 1.58 .665 Total 4.49 .213 1.10 .777 0.82 .844 1.31 .727 Note. df = 3. n = 32 for each of the computer- and self-paced groups.
The lack of statistical significance on the recognition tests may be partly ascribed to a possible ceiling effect. As Table 20 demonstrates, the average scores were relatively high on all recognition posttests. In order to examine the possibility of a ceiling effect, the frequency of full marks (15) was tabulated in Table 24. The table gives the number of participants who obtained full marks on a given posttest for the items studied in a given condition. For instance, Table 24 shows that on the immediate productive recognition test, 21 (65.63%) out of the 32 participants in the computer-paced group obtained full marks for the items studied in the recognition condition. The table reveals that when collapsed across the learning conditions and the feedback pacing groups, 59.77%, 59.38%, 39.84%, and 51.95% of the scores were full marks on the immediate productive, immediate receptive, delayed productive, and delayed receptive recognition tests, respectively. The large number of full marks might have caused a ceiling effect, contributing to the lack of statistical significance on the recognition tests.
157
Table 24 Frequency of Full Marks (15) on the Recognition Posttests Immediate productive recognition PC-paced
Self-paced
Immediate receptive recognition
Total
PC-paced
Self-paced
Delayed productive recognition
Total
PC-paced
Self-paced
Delayed receptive recognition
Total
PC-paced
Self-paced
Total
Conditions
Freq
%
Freq
%
Freq
%
Freq
%
Freq
%
Freq
%
Freq
%
Freq
%
Freq
%
Freq
%
Freq
%
Freq
%
Recognition
21
65.63%
16
50.00%
37
57.81%
22
68.75%
17
53.13%
39
60.94%
13
40.63%
11
34.38%
24
37.50%
19
59.38%
13
40.63%
32
50.00%
Recall
19
59.38%
17
53.13%
36
56.25%
21
65.63%
17
53.13%
38
59.38%
15
46.88%
12
37.50%
27
42.19%
21
65.63%
12
37.50%
33
51.56%
Combined
23
71.88%
18
56.25%
41
64.06%
23
71.88%
13
40.63%
36
56.25%
16
50.00%
9
28.13%
25
39.06%
21
65.63%
15
46.88%
36
56.25%
Difficulty
24
75.00%
15
46.88%
39
60.94%
23
71.88%
16
50.00%
39
60.94%
13
40.63%
13
40.63%
26
40.63%
18
56.25%
14
43.75%
32
50.00%
Total
87
67.97%
66
51.56%
153
59.77%
89
69.53%
63
49.22%
152
59.38%
57
44.53%
45
35.16%
102
39.84%
79
61.72%
54
42.19%
133
51.95%
Note. n = 32 for each of the computer-paced and self-paced groups. The maximum frequency is 32 for Recognition, Recall, Combined, and Difficulty and 128 for Total. Freq = number of participants who obtained full marks on a given posttest for the items studied in a given condition. % = proportion of participants who obtained full marks on a given posttest for the items studied in a given condition.
158 Efficiency As there was a statistically significant difference in study time among the four conditions (Table 18), the efficiency of the four conditions was also compared. Efficiency was defined as the number of words acquired per minute and calculated by dividing the posttest score by the study time (e.g., Kornell, 2009; Mondria, 2003; Pyc & Rawson, 2007). Table 25 provides the efficiency scores in the four conditions. In order to test whether any significant difference existed among the four conditions, the efficiency scores were entered into a three-way mixed design 4 (learning conditions) x 2 (feedback pacing) x 2 (RI) ANOVA. Table 26 shows the results of the ANOVAs.
159
Table 25 Efficiency Scores in the Four Conditions Immediate posttest Recognition a
Feedback
Posttests
Computer-
P recall (strict) P recall (sensitive) P recognition R recall R recognition P recall (strict) P recall (sensitive) P recognition R recall R recognition P recall (strict) P recall (sensitive) P recognition R recall R recognition
paced (n = 32)
Selfpaced (n = 32)
Total (n = 64)
Recall
Delayed posttest
Combined
Difficulty
Recognition
Recall
Combined
Difficulty
M
SD
M
SD
M
SD
M
SD
M
SD
M
SD
M
SD
M
SD
0.71 1.13 1.35 1.13 1.35 0.87 1.36 1.83 1.45 1.87 0.79 1.24 1.59 1.29 1.61
0.34 0.31 0.19 0.28 0.22 0.57 0.72 0.56 0.68 0.55 0.47 0.56 0.48 0.54 0.49
0.69 0.90 1.07 0.91 1.07 0.79 1.05 1.51 1.08 1.45 0.74 0.98 1.29 1.00 1.26
0.31 0.25 0.16 0.25 0.16 0.44 0.45 0.39 0.41 0.46 0.38 0.37 0.37 0.35 0.39
0.74 0.96 1.21 1.01 1.20 0.79 1.14 1.59 1.24 1.54 0.77 1.05 1.40 1.13 1.37
0.31 0.27 0.15 0.23 0.15 0.51 0.57 0.41 0.52 0.44 0.42 0.46 0.36 0.42 0.37
0.76 0.91 1.12 0.99 1.13 0.75 1.02 1.42 1.08 1.44 0.76 0.96 1.27 1.03 1.29
0.30 0.27 0.21 0.24 0.18 0.40 0.39 0.42 0.44 0.46 0.35 0.34 0.36 0.35 0.38
0.33 0.58 1.28 1.08 1.33 0.35 0.71 1.70 1.42 1.77 0.34 0.64 1.49 1.25 1.55
0.28 0.38 0.26 0.31 0.22 0.37 0.52 0.60 0.63 0.58 0.32 0.46 0.50 0.52 0.49
0.32 0.52 1.01 0.89 1.05 0.36 0.59 1.33 1.05 1.41 0.34 0.55 1.17 0.97 1.23
0.23 0.27 0.20 0.23 0.22 0.38 0.44 0.37 0.42 0.37 0.31 0.36 0.34 0.35 0.36
0.31 0.53 1.14 0.95 1.18 0.34 0.62 1.44 1.20 1.51 0.32 0.57 1.29 1.07 1.35
0.30 0.33 0.19 0.26 0.18 0.34 0.46 0.44 0.51 0.40 0.32 0.40 0.37 0.42 0.35
0.36 0.51 1.08 0.94 1.10 0.41 0.63 1.33 1.04 1.41 0.38 0.57 1.21 0.99 1.25
0.28 0.31 0.17 0.27 0.20 0.32 0.41 0.47 0.45 0.44 0.30 0.37 0.37 0.37 0.37
Note. aP = productive; R = receptive.
160
Table 26 Results of Three-Way ANOVAs for the Efficiency Scores Productive recall: Strict df
F
p
Feedback
1
0.47
.497
RI
1
166.57
2.35
Productive recall: Sensitive
partial
df
F
p
.36
1
2.29
.135
.000
.73
1
173.79
0.79
.476
.01
2.14
1
0.36
.551
.01
Condition X Feedback
2.35
0.80
.470
RI X Condition
2.48
2.28
2.48
3.02
Effects
Condition RI X Feedback
RI X Condition X Feedback
partial
Productive recognition df
F
p
.04
1
22.42
.000
.000
.74
1
42.83
18.84
.000
.23
2.12
1
0.90
.347
.01
.01
2.14
0.75
.481
.093
.04
2.51
13.11
.041
.05
2.51
0.81
2
η
Receptive recall
partial
df
F
p
.27
1
5.00
.029
.000
.41
1
12.57
38.68
.000
.38
2.39
1
5.83
.019
.09
.01
2.12
2.55
.079
.000
.17
2.57
2.25
.473
.01
2.57
0.89
2
η
Receptive recognition partial
partial
df
F
p
.07
1
22.20
.000
.26
.001
.17
1
14.91
.000
.19
30.20
.000
.33
1.82
48.20
.000
.44
1
0.11
.745
.00
1
1.83
.181
.03
.04
2.39
4.71
.007
.07
1.82
2.90
.064
.04
.094
.03
3
0.35
.791
.01
2.37
0.70
.521
.01
.433
.01
3
0.15
.933
.00
2.37
1.08
.351
.02
2
η
2
η
η2
161
Table 26 indicates the following four things. First, the main effect of RI was significant on all posttests. In other words, the efficiency scores on the delayed posttests were lower than those on the immediate posttests. Second, the ANOVAs detected a significant main effect of feedback pacing on the productive recognition, receptive recall, and receptive recognition tests. The results suggest that self-paced feedback was significantly more efficient than computer-paced feedback on these three posttests. Third, the main effect of condition was significant on all posttests except with strict scoring on the productive recall test. Fourth, the ANOVAs found some significant interactions (Table 26). However, as the effect sizes for the interactions were relatively small (.05 < partial η2 < .17), they will not be discussed here. Due to the significant main effect of condition, contrasts were used to investigate where the significant differences lay. Results of the contrasts are reported in Table 27. The table presents the F values, p values, and effect sizes r for the pair-wise contrasts. Contrasts were not performed for the productive recall test with strict scoring because the main effect of condition was not significant (Table 26).
162 Table 27 Results of Pair-Wise Contrasts for Efficiency Scores Recognition Conditions
Recall
F
p
r
Recall
25.99
.000
.54
Combined
26.47
.000
Highest difficulty
31.09
Recall
Combined
F
p
r
.55
2.99
.089
.21
.000
.58
0.01
.905
.02
53.12
.000
.68
Combined
42.01
.000
.64
23.04
.000
.52
Highest difficulty
59.91
.000
.70
0.13
.721
.05
Recall
50.89
.000
.67
Combined
25.76
.000
.54
16.69
.000
.46
Highest difficulty
48.85
.000
.66
1.55
.217
.16
Recall
73.27
.000
.74
Combined
77.98
.000
.75
22.54
.000
.52
Highest difficulty
58.36
.000
.70
1.28
.262
.14
Difficulty
F
p
r
3.83
.055
.24
12.28
.001
.41
7.19
.009
.32
8.57
.005
.35
F
p
r
a
P recall test (sensitive) Recognition
P recognition test Recognition
R recall test Recognition
R recognition test Recognition
Note. df = (1, 62). a P = productive; R = receptive.
First, let us examine the efficiency scores on the productive recall test with sensitive scoring. On this test, the recognition condition significantly outperformed the other three conditions (Table 27). Furthermore, large effect sizes (.54 < r < .58) were found between the recognition and the other three conditions. The results indicate that the recognition condition was the most efficient among the four. No statistically significant difference was detected among the recall, combined, and highest difficulty conditions. The results are supported by the effect sizes (.02 < r < .24), which are regarded as having no more than small effects (Cohen, 1988). These results suggest
163 the following order with sensitive scoring on the productive recall test: recognition > recall = combined = highest difficulty.
Next, let us consider the efficiency scores on the productive recognition, receptive recall, and receptive recognition tests. The contrasts (Table 27) revealed that on these three posttests, (a) the recognition condition was significantly more efficient than the other three conditions, (b) the combined condition significantly outperformed the recall and highest difficulty conditions, (c) no statistically significant difference existed between the recall and highest difficulty conditions, and (d) medium to large effect sizes (.32 < r < .75) were found for all comparisons except between the recall and highest difficulty conditions (.05 < r < .16). These results suggest the following order on the productive recognition, receptive recall, and receptive recognition tests: recognition > combined > recall = highest difficulty. One caveat, though, is that the efficiency scores on the recognition tests might have been affected by a potential ceiling effect on the recognition test scores (Table 24). The efficiency scores on the recognition tests, hence, may need to be interpreted with caution.
Questionnaire In the questionnaire given after the immediate posttest, participants were asked to evaluate the effectiveness of the four kinds of retrieval formats (receptive recognition, productive recognition, receptive recall, and productive recall) on a 7-point scale, where 1 means Not helpful at all and 7 means Very helpful (see 3.4.3). Table 28
164 summarises the participants’ responses. The responses were analysed by a two-way mixed design 4 (retrieval format: receptive recognition / productive recognition / receptive recall / productive recall) x 2 (feedback pacing) ANOVA. The ANOVA showed a significant main effect of condition, F (1.80, 109.97) = 16.96, p < .001, partial η2 = .22. The main effect of feedback pacing, F (1, 61) = 1.86, p = .138, partial η2 = .03, and the interaction between the learning condition and feedback pacing were not significant, F (1.80, 109.97) = 1.86, p = .164, partial η2 = .03.
Table 28 Participants’ Evaluation of Four Retrieval Formats Retrieval formata Feedback Computer-paced (n = 31) Self-paced (n = 32) Total (n = 63)
R recognition M SD 5.65 1.47 5.59 1.07 5.62 1.28
P recognition M SD 5.87 1.43 5.69 1.31 5.78 1.36
R recall M SD 4.77 1.38 4.78 1.45 4.78 1.41
P recall M SD 4.71 1.74 3.66 1.94 4.17 1.91
Note. The maximum score is 7, where 1 means Not helpful at all and 7 means Very helpful. One participant in the computer-paced group did not provide responses. a P = productive; R = receptive.
Due to the significant main effect of condition, contrasts were performed to investigate where the significant differences lay when collapsed across the two feedback pacing groups. The contrasts showed no statistically significant difference between receptive recognition and productive recognition, F (1, 61) = 0.96, p = .331, yielding a small effect size (r = .12). The differences were statistically significant for all other comparisons, and medium to large effect sizes were found, F (1, 61) > 9.85, p < .003, .37 < r < .56. These results suggest that the participants perceived the four types of retrieval formats to be effective in the following order: receptive recognition
165 = productive recognition > receptive recall > productive recall.
3.6 Discussion Effects of recall and recognition The first research question in this study asked whether recall increases L2 vocabulary learning more than recognition. The current study demonstrated only a limited advantage of recall. The recall condition led to significantly higher scores than the recognition condition on the productive recall test. However, no significant difference was detected between recognition and recall on the other three posttests. Why did the present study find only a limited superiority of recall? First, let us consider the scores on the receptive recall posttest. On this test, no statistically significant difference existed between the recall and recognition conditions. Although the result is consistent with Van Bussel (1994) and Clariana and Lee (2001), it is at odds with Bjork and Whitten (1974) and Carpenter and DeLosh (2006), who found the benefits of recall over recognition on a recall posttest. The incongruent results may stem from two differences in the experimental procedure. First, the studies differ in the retrieval frequency. In Bjork and Whitten (1974) and Carpenter and DeLosh (2006), the target items were practised in a recall or recognition format only once. In Van Bussel (1994) and the present study, in contrast, the retrieval frequency was greater than one. As the earlier studies demonstrate (Bjork & Whitten, 1974; Carpenter & DeLosh, 2006), recall may be more effective than recognition after one retrieval attempt possibly due to the positive effects of desirable difficulties, retrieval effort, and generation (see
166 3.1.2). Yet, the superiority of recall might diminish and eventually disappear as the number of retrievals increases. This could be partially responsible for the inconsistent results.
Second, the conflicting results might also be due in part to the absence or presence of feedback. In Van Bussel (1994), Clariana and Lee (2001), and the current study, feedback was given during retrieval practice, whereas Bjork and Whitten (1974) and Carpenter and DeLosh (2006) did not provide feedback. The results suggest that although recall may outperform recognition when feedback is not given (Bjork & Whitten, 1974; Carpenter & DeLosh, 2006), the advantage of recall may not be found when the treatment involves feedback (Clariana & Lee, 2001; Van Bussel, 1994). This may be because the beneficial effects of recall might be offset by the effects of feedback, overshadowing differences in the retrieval format. One previous study on reading (Kang et al., 2007) also demonstrated that the effects of recall and recognition might be conditional upon the absence or presence of feedback. Kang et al.’s finding may apply not only to the retention of reading materials but also to the learning of vocabulary.
Next, let us turn our attention to the productive recall posttest scores. The recall condition resulted in significantly higher scores than the recognition condition on this test. Why was the advantage of recall found not on the receptive but on the productive recall posttest? One possible explanation may be that the productive recall test
167 required precise knowledge of orthography unlike the receptive recall. TAP theory (Bransford et al., 1979; Morris et al., 1977) predicts that the best way to learn the spelling of L2 words would be to practise retrieval of the word forms. As a result, on the productive recall test, the recall condition that involved productive recall might have fared better than recognition. As noted above, due to the differences in the experimental procedure (i.e., repeated retrieval and provision of feedback), the advantage of recall was probably smaller in this study compared with the previous vocabulary studies (Bjork & Whitten, 1974; Carpenter & DeLosh, 2006). However, because of TAP, the superiority of recall on the productive recall test might not have diminished to the extent that it would completely disappear. This may be in part the reason why recall was significantly more effective than recognition on the productive recall test, but not on the receptive. Alternatively, the results may be partially due to the test order. At each test administration, the productive recall and recognition posttests were given prior to the receptive recall test (see 3.4.6). As correct responses in the latter were used as cues in the former, the productive tests might have affected performance on the receptive test, possibly diminishing a potential difference between the recall and recognition conditions on the receptive recall posttest.
Let us now consider the results on the recognition posttests. The recall condition failed to outperform the recognition condition on either the receptive or productive recognition test. These results are at odds with the existing studies on reading (Foos & Fisher, 1988; Glover, 1989; Kang et al., 2007; McDaniel et al., 2007), but are
168 congruent with Carpenter and DeLosh (2006) and Van Bussel (1994), who looked into the learning of L1 and L2 vocabulary, respectively. The findings seem to suggest that when learning is measured by a recognition test, recall may enhance the retention of reading materials, but not of vocabulary. Another explanation may be a possible ceiling effect on the recognition tests. On the recognition tests in the present study, nearly or more than half of the scores were full marks (Table 24). The large number of full marks might have been partly responsible for the lack of statistical significance between recall and recognition.
Effects of the combined treatment The second research question in this study asked whether a combination of recognition and recall (the combined treatment) increases learning more than either one alone. The retrieval practice effect (e.g., Baddeley, 1997, p. 112; Ellis, 1995) and retrieval effort hypothesis (e.g., Pyc & Rawson, 2009) predict that the combined treatment may maximise learning (see 3.1.2). The present study, yet, demonstrated only limited benefits of the combined treatment. Although the combined condition was significantly more effective than the recognition condition with strict scoring on the productive recall test, there was no difference between it and the recognition or recall conditions on the other posttests. In addition, the combined condition was less effective than the recall condition on the productive recall test. These results contradict the assumption that gradually increasing retrieval effort may enhance learning (Finley et al., 2011; Karpicke & Roediger, 2010; Logan & Balota, 2008; Pyc
169 & Rawson, 2009). The findings are also inconsistent with Clariana and Lee (2001), who found that a combined treatment fared better than recognition.
Why was the combined treatment not very effective in this study? One possible cause may be the relatively low retrieval success during the learning phase. As described earlier (see 3.1.2), an advantage of the combined treatment may lie in its ability to increase retrieval effort while maintaining retrieval success. In this study, however, the combined treatment did not necessarily produce high retrieval success. For instance, the number of correct responses for the productive recall format during learning was 5.86 out of 15 in the combined condition, whereas it was 6.81 and 6.98 in the recall and highest difficulty conditions, respectively (Table 19, fourth retrieval). Because the rate of retrieval success was not particularly high, participants perhaps could not benefit from the positive effects of retrieval success in the combined condition. The combined treatment in this study, thus, might not have proved very effective.
Compared with Clariana and Lee (2001), the current study found only limited benefits of the combined treatment. There may be two explanations for the contradictory results. First, in Clariana and Lee, the recognition and combined treatments were not controlled for the retrieval frequency: While target items were practised twice in the combined group, there was only one retrieval attempt in the recognition group (see 3.1.2). In this study, in contrast, all four learning conditions had the same number of
170 retrieval attempts (four). This may partially be the reason why the combined treatment in Clariana and Lee fared better than in this study. Second, the contradictory results may be ascribed in part to a difference in learning phase performance. The combined treatment in Clariana and Lee led to a relatively high retrieval success rate (95%) during learning. As a result, in their study, the combined treatment might have been able to increase retrieval effort while facilitating successful retrieval. In the present study, in contrast, the combined treatment did not necessarily produce high retrieval success (Table 19), and the participants might not have been able to benefit from the positive effects of retrieval success. This may also be responsible for the incongruent results between Clariana and Lee and this study.
In the current study, the combined condition was less effective than the recall condition on the productive recall test (p = .009, r = .32 with strict and p = .054, r = .24 with sensitive scoring). This is probably because the recall condition involved more productive recall questions per target word (two) than the combined condition (one). As productive recall is rather demanding (Laufer et al., 2004; Laufer & Goldstein, 2004), practising L2 words in a productive recall format once perhaps did not guarantee successful performance on the productive recall posttest. Consequently, the combined treatment might have proved less effective than recall on the productive recall test.
171 Effects of the highest difficulty treatment The third research question in this study asked whether the highest difficulty treatment, which consists solely of productive recall, is effective. There exist conflicting views about the effectiveness of the highest difficulty treatment. First, as productive recall tends to produce a low rate of retrieval success, the highest difficulty treatment may not be effective based on the retrieval practice effect (e.g., Baddeley, 1997, p. 112; Ellis, 1995). Second, TAP theory (Bransford et al., 1979; Morris et al., 1977) predicts that the highest difficulty treatment may not enhance the acquisition of receptive vocabulary knowledge because it consists only of productive retrieval. In contrast, the desirable difficulty framework (Bjork, 1994, 1999; Schmidt & Bjork, 1992) suggests that the highest difficulty treatment may be effective (see 3.1.2). The present study indicated that the highest difficulty treatment may enhance learning. The highest difficulty condition resulted in significantly higher scores than the recognition and combined conditions with strict scoring on the productive recall test. The condition was also as effective as the other three on all other posttests including receptive tests. These findings are consistent with the desirable difficulty framework, but contradict the predictions of the retrieval practice effect and TAP theory. Why did the highest difficulty treatment turn out to be more effective than some theories had predicted?
First, the retrieval practice effect suggests that the highest difficulty treatment may not be very effective because it may produce low retrieval success, and learners may not
172 be able to benefit from the positive effects of retrieval success (see 3.1.2). This prediction was not supported as the highest difficulty treatment proved effective despite the largest number of retrieval failures during the learning phase (Table 19). The results may be partially accounted for by the desirable difficulty framework, which states that a treatment that induces a large number of errors during learning can be beneficial over time. The highest difficulty treatment proved effective possibly because the positive effects of desirable difficulties outweighed the negative effects of retrieval failures (e.g., Bjork, 1994, 1999; Karpicke & Bauernschmidt, 2011; Pashler et al., 2003; Schmidt & Bjork, 1992; Schneider et al., 2002).
Second, according to TAP theory, the highest difficulty treatment may not enhance the acquisition of receptive vocabulary knowledge because it consists only of productive retrieval, whereas the other three treatments involve both receptive and productive retrieval (see 3.1.2). Previous studies on receptive and productive learning (Griffin & Harley, 1996; Mondria & Wiersma, 2004; Steinel et al., 2007; Stoddard, 1929; Waring, 1997a; S. A. Webb, 2009a, 2009b) may also support this prediction. The highest difficulty condition, nonetheless, turned out to be as effective as the other three conditions even on the receptive posttests. The results may also be in part accounted for by the desirable difficulty framework, according to which a difficult learning condition facilitates transfer (see 3.1.1). Because the highest difficulty condition was the most demanding among the four conditions, it might have resulted in a kind of knowledge that was transferable to novel environments. The highest difficulty
173 treatment, thus, might have enhanced the acquisition of not only productive but also receptive knowledge. Pedagogically, the results contradict the view that L2 words should be practised both receptively and productively (Griffin & Harley, 1996; Mondria & Wiersma, 2004; Nation & Webb, 2011, p. 41; Nation, 2001, p. 306; S. A. Webb, 2002, 2009b) and imply that productive retrieval may be sufficient as long as the treatment introduces difficulty for the learner. Alternatively, the results may be partially due to the test order. Because the productive recall and recognition posttests were given prior to the receptive recall and recognition tests (see 2.2.9), the former might have affected performance on the latter, possibly diminishing a potential difference in the receptive test scores. Future research may administer fewer posttests per participant in order to reduce potential learning effects from posttests.
Effects of feedback pacing Two types of feedback were used in this study: computer-paced and self-paced. In the self-paced group, learners spent significantly more time on feedback after recall questions than recognition. This is probably because recall produced more retrieval failures than recognition (Table 19), encouraging participants to study feedback more carefully. In the self-paced group, a statistically significant difference was also detected among the four learning conditions in the feedback duration: The highest difficulty condition led to the longest feedback duration followed by the recall, combined, and recognition conditions. Despite the difference in the feedback duration, the present study did not find any significant interaction between learning condition
174 and feedback pacing on the posttest scores (Table 21). In other words, while some conditions resulted in a longer feedback duration than others, the effects of feedback pacing on the posttest scores were constant across all four conditions.
Although it was not the original purpose, the current study also allowed us to compare the effectiveness of computer- and self-paced feedback. The results showed that (a) computer-paced feedback increased the feedback duration by 2.70 seconds per response compared with self-paced feedback, and (b) computer-paced feedback may improve both learning phase and posttest performance. The findings may indicate that when feedback is self-paced, learners tend to close the feedback window too hastily, leading to under-learning. The results seem congruent with the finding that learners may sometimes overestimate their memory abilities and stop studying before lexical items are actually learnt (Karpicke, 2009; Kornell & Bjork, 2008).
The advantage of computer-paced feedback was smaller on the recognition posttests compared with the recall posttests probably because the scores were near the ceiling on the recognition tests. When collapsed across all four conditions, in the self-paced group, 51.56%, 49.22%, 35.16%, and 42.19% of the scores were full marks on the immediate productive, immediate receptive, delayed productive, and delayed receptive recognition tests, respectively (Table 24). A large number of full marks suggest that there was perhaps only little room for the computer-paced group to outperform the self-paced group. Consequently, the difference between the two types
175 of feedback might not have reached significance on either the productive (p = .106 for the immediate and p = .098 for delayed tests) or receptive recognition test (p = .076 for the immediate and p = .065 for delayed tests).
Pedagogical implications On the productive recall posttest, the recall and highest difficulty conditions fared better than the recognition and combined conditions. The former were effective on the productive recall test probably because they involved more productive recall questions (two in the recall and four in the highest difficulty conditions per target word) than the latter (zero in the recognition and one in the combined conditions). A pedagogical implication of these results may be that for the acquisition of knowledge of orthography, L2 words may need to be practised in a productive recall format at least twice. Of course, the retrieval frequency needed for acquisition to occur may be affected by a number of factors such as the difficulty of target words or learners’ memory capacity, which awaits future research. The recall and highest difficulty conditions were found to be effective although they decreased learning phase performance. Based on the findings, it may be useful to raise awareness that making mistakes during learning is not necessarily a sign of ineffective learning (e.g., Bjork, 1994, 1999; Ellis, 1995; Pashler et al., 2003; see 2.3.7).
This study also indicated that for the acquisition of form-meaning connections, recognition formats may be more desirable than recall for at least three reasons. First,
176 the recognition condition was as effective as the other three conditions on all posttests except the productive recall test. The results suggest that recognition formats alone may be sufficient for learning the form-meaning connection. Second, the recognition condition required the least study time and was the most efficient on all posttests except with strict scoring on the productive recall test. Recognition, hence, may be more effective than recall in terms of efficiency. Third, the recognition condition produced significantly more correct responses during retrieval practice than the other three conditions. As incorrect responses during learning may potentially demotivate learners (Fritz, 2010; Logan & Balota, 2008; Mondria & Mondria-de Vries, 1994), recognition formats may be more motivating. The results of the questionnaire, which showed that participants considered recognition to be more effective than recall, may also indicate that recognition may have a positive effect on learners’ motivation. These three findings suggest that if knowledge of spelling is not required, recognition may be more desirable than recall.
Although the present study demonstrated the value of the recognition formats for flashcard learning, existing flashcard programs seem to offer limited capabilities regarding recognition. Among nine programs examined by Nakata (2011), for instance, only five fully support recognition formats. The limited support for the recognition formats in existing flashcard programs may possibly stem from the fact that recall formats mirror paper-based flashcard learning more than recognition formats. Alternatively, it may be partly ascribed to apparent misconceptions about the value of
177 recognition for learning. For instance, the developers of LearnThatWord, a web-based flashcard program, argue that their software does not offer recognition questions because learners sometimes confuse a distractor with the correct response and may learn incorrect information from recognition questions (eSpindle Learning, 2013). However, this argument may not necessarily have empirical support because research shows that the risk of learning incorrect information from multiple-choice questions is not very high (Marsh et al., 2007, 2009; Roediger et al., 2010, 2011). The developers of LearnThatWord go on to claim that ‘Multiple-choice questions are not a teaching tool at all’ and ‘To use multiple-choice questions in instructional materials is problematic!’ (eSpindle Learning, 2013). The present study, however, did not support such observations. Based on the results of the present and earlier studies (Marsh et al., 2007, 2009; Roediger et al., 2010, 2011), it may be useful to reconsider the value of recognition for vocabulary learning when designing flashcard software or vocabulary exercises.
The present study also suggested that computer-paced feedback might be more effective than self-paced feedback although it increased the feedback duration by 2.70 seconds per response. Based on the results of this study, it may be useful to reconsider the value of computer-paced feedback for flashcard learning. The results may have important pedagogical implications because self-paced feedback seems to be a common feature among existing flashcard programs (Nakata, 2011; see 4.2.3). As the current study suggested that learners may overestimate their memory capacities and
178 close the feedback window too hastily, self-paced feedback may not be advisable. Instead, it might be valuable if flashcard software encouraged learners to spend more time on feedback by presenting feedback for a minimum of 5 seconds, for instance. However, as only small effect sizes were observed for the feedback pacing, further empirical studies are warranted.
Next, let us consider the implications of this study for paper-based flashcard learning. First, the present study showed that (a) recall may be more effective than recognition for the acquisition of knowledge of orthography, and (b) recall may be as effective as recognition for the acquisition of form-meaning connections. The findings may translate well to paper-based flashcard learning, where recall formats may be easier to implement than recognition. Second, the comparison of computer- and self-paced feedback suggested that increasing the feedback duration by 2.70 seconds per response may increase learning. The result may have important pedagogical implications for paper-based flashcard learning, where feedback is usually paced by learners. Based on the results of this study, it may be useful to raise awareness that spending several extra seconds viewing the correct response may potentially increase learning.
3.7 Limitations One limitation of the present study may be a possible ceiling effect on the recognition posttests. On the recognition tests, nearly or more than half of the scores were full
179 marks (Table 24). The large number of full marks may be partly responsible for the lack of statistical significance on the recognition tests. A potential ceiling effect might have also affected the efficiency scores on the recognition posttests. Based on the results of the pilot studies, several changes were made to the methodology in order to reduce the probability of a ceiling effect (see 3.3). Yet, apparently, these changes proved insufficient. A ceiling effect might have been caused because four kinds of posttests (productive recall, productive recognition, receptive recall, and receptive recognition) were given to each participant, potentially causing earlier tests to affect performance on later tests. Future research may administer fewer posttests per participant in order to reduce potential learning effects from posttests.
Another limitation may be the lack of prior knowledge of the participants. The participants in this study had no prior knowledge of Swahili, the target language. The findings of this study, therefore, may not necessarily be applicable to more advanced learners. It may be useful to replicate this study with higher proficiency learners. A third limitation may be the rather short duration of the treatment. In the present study, participants studied 60 word pairs in less than 45 minutes. Furthermore, the study opportunities were massed into a single treatment session. In a real-life study situation, however, study opportunities tend to be distributed over multiple sessions (Cepeda et al., 2008). It may be valuable to investigate the effects of recall and recognition over a longer period of time.
180
181 Chapter 4. STUDY 3: ABSOLUTE AND RELATIVE SPACING Research suggests that the distribution of practice may have a large effect on learning (e.g., Cepeda et al., 2006, 2009; Dempster, 1989; Janiszewski et al., 2003). Given that introducing spacing between encounters increases learning (spacing effect; see 4.1.1), how should we space learning opportunities in order to optimise flashcard learning? The above question may be rephrased as follows: What kinds of relative and absolute spacing should be used to maximise learning? Relative spacing refers to how study opportunities are distributed relative to one another (Karpicke & Bauernschmidt, 2011). Examples of relative spacing schedules include equal, expanding, and contracting spacing (see below for the definitions). Absolute spacing refers to the total amount of spacing that separates all repetitions of a given item (Karpicke & Bauernschmidt, 2011). For instance, if a given item is encountered four times, and there is an ISI (inter-stimulus interval; see 2.1.2) of 2 minutes between each encounter, absolute spacing is 6 minutes (2 minutes x 3).
Previous studies have looked into the effects of three types of relative spacing: equal, expanding, and contracting (e.g., Gay, 1973; Gerbier & Koenig, 2012; Karpicke & Bauernschmidt, 2011; Landauer & Bjork, 1978; Tsai, 1927). In equal spacing, the intervals between encounters of a given item are held constant. In expanding spacing (also known as expanding or expanded rehearsal; Ellis, 1995), the intervals between encounters are gradually increased. In contracting spacing, the intervals between encounters are gradually decreased. Examples of equal spacing include the 3-3-3,
182 4-4-4, and 5-5-5 schedules. Numbers indicate an ISI between repetitions of a given item (Figure 13). For instance, in the 3-3-3 schedule, encounters of a given item are always separated by three spacing units (trial or time; see below for details). Examples of expanding spacing include the 1-3-5, 1-3-8, and 1-5-9 schedules. In the 1-3-5 schedule, the first and second encounters of a given item are separated by one spacing unit, the second and third encounters are separated by three units, and the third and fourth encounters are separated by five units (Figure 13). Examples of contracting spacing include the 5-3-1, 8-3-1, and 9-5-1 schedules.
Note. A denotes a given target item, and _ denotes an intervening spacing unit (trial or time). Figure 13. Examples of equal, expanding, and contracting spacing.
Most existing studies use the number of intervening trials as a unit of spacing (e.g., Balota et al., 2006; Carpenter & DeLosh, 2005; Cull et al., 1996, Experiments 1-3; Karpicke & Bauernschmidt, 2011; Karpicke & Roediger, 2007a; Landauer & Bjork, 1978; Logan & Balota, 2008; Pyc & Rawson, 2007). In these studies, the 3-3-3 schedule refers to a schedule where encounters of a given item are always separated by three trials for other items. Other studies use time as a spacing unit (e.g., Cull et al., 1996, Experiments 4 & 5; Dobson, 2011; Kang, Lindsey, Mozer, & Pashler, 2013; Storm et al., 2010). In the 3-3-3 schedule in these studies, encounters of a given item
183 are separated by 3 minutes or days.
Although many applied linguists as well as psychologists regard expanding spacing as the most effective relative spacing schedule (Baddeley, 1997, pp. 112–114; Balota, Duchek, & Logan, 2007; Bjork, 1994; Cepeda et al., 2006; Ellis, 1995; Hulstijn, 2001; Karpicke & Bauernschmidt, 2011; Karpicke & Roediger, 2007a; Mondria & Mondria-de Vries, 1994; Nation, 2001, pp. 76–77; Pimsleur, 1967; Roediger & Karpicke, 2010; Schmitt & Schmitt, 1995; Schmitt, 2000, 2007), empirical studies have yielded mixed results regarding its effects. While some studies failed to find any advantage of expanding over equal or contracting spacing (e.g., Carpenter & DeLosh, 2005; Cull, 2000; Karpicke & Bauernschmidt, 2011; Karpicke & Roediger, 2010; Pyc & Rawson, 2007), other studies suggest that expanding spacing may be more effective when (a) the task difficulty is high (Logan & Balota, 2008; Storm et al., 2010), (b) feedback is not provided (e.g., Balota et al., 2007; Cepeda et al., 2006; Cull et al., 1996; Storm et al., 2010), (c) the treatment involves long absolute spacing (Dobson, 2012; Maddox, Balota, Coane, & Duchek, 2011), and / or (d) the RI (retention interval) is shorter than 24 hours (e.g., Balota et al., 2007; Karpicke & Roediger, 2007a; Logan & Balota, 2008; Roediger & Karpicke, 2010; Storm et al., 2010; see 4.2.2). Due to the inconsistent findings, however, it is not clear whether expanding spacing increases L2 vocabulary learning. As for absolute spacing, previous cognitive psychology studies suggest that the optimal ISI may be determined relative to the RI (e.g., Bird, 2010; Cepeda et al., 2006, 2008, 2009; Pashler et al., 2007; Rohrer &
184 Pashler, 2007; see 4.1.2). Yet, a possible relationship between the ISI, RI, and retention has not been explored thoroughly in the L2 vocabulary acquisition literature. Thus, it is unclear whether the optimal ISI may be determined as a function of the RI for L2 vocabulary learning.
Study 3 attempted to identify the optimal absolute and relative spacing schedules for L2 vocabulary learning. Participants were assigned to one of the four absolute spacing groups: massed, short, medium, and long spacing. In the massed group, target items were repeated without any spacing. The short, medium, and long spacing groups used absolute spacing of 15, 30, and 90 intervening trials, respectively. Relative spacing was manipulated within participants, and equal and expanding spacing were compared. Findings of this study may allow us to determine what kinds of absolute and relative spacing schedules should be used in order to optimise flashcard learning.
It may be useful here to point out a distinction between two types of spacing: between-session spacing and within-session spacing (e.g., Kang et al., 2013; Kornell, 2009). The former refers to the amount of spacing between separate study sessions. For instance, suppose learners decided to study a given set of L2 words on three separate occasions. Would it be more effective to study them every 3 days or 10 days? This question is concerned with the optimal between-session spacing schedule. Within-session spacing, in contrast, refers to the amount of spacing between encounters of a given item in a single study session. For instance, suppose learners
185 decided to study a given set of L2 words in three study sessions. In each session, should encounters of a given item be separated by 3 minutes or 10 minutes? This question is concerned with the optimal within-session spacing schedule. The present study examines within-session spacing rather than between-session spacing.
4.1 Effects of Absolute Spacing 4.1.1 The spacing effect and the lag effect There are two robust phenomena regarding the effects of absolute spacing. First, research shows that spaced learning, which involves spacing between repetitions of a given item, yields superior retention to massed learning, which does not involve any spacing (see Cepeda et al., 2006; Dempster, 1989, 1996; Janiszewski et al., 2003, for meta-analyses). This phenomenon is known as the spacing effect. Second, it has also been demonstrated that larger absolute spacing (e.g., an ISI of 10 days) generally leads to better long-term retention than shorter absolute spacing (e.g., an ISI of 1 day). This phenomenon is referred to as the lag effect (e.g., Bird, 2010; Cepeda et al., 2006, 2008, 2009; Pashler et al., 2007; Rohrer & Pashler, 2007). Both the spacing effect and the lag effect, which are collectively referred to as the distributed practice effect (Cepeda et al., 2006), have been found to affect L2 vocabulary acquisition (e.g., Bahrick et al., 1993; Bahrick & Phelps, 1987; Bloom & Shuell, 1981; Karpicke & Bauernschmidt, 2011; Pashler et al., 2003).
Although the spacing effect has been observed regardless of the RI (Cepeda et al.,
186 2006), the lag effect is found to be sensitive to changes in the RI. That is, although a long ISI tends to be effective at a long RI, a short ISI is found to be effective at a short RI (e.g., Balota et al., 1989; Bird, 2010; Cepeda et al., 2006, 2008, 2009; Glenberg, 1976; Glenberg & Lehmann, 1980; Pashler et al., 2007; Rohrer & Pashler, 2007). This phenomenon is known as the ISI-RI interaction. One implication of the ISI-RI interaction may be that for a given RI, although increasing absolute spacing initially increases retention, it may decrease retention when the ISI becomes too long. (We will consider when the ISI becomes too long later in this section.) The ISI-RI interaction has led researchers to examine what kind of relationship exists between the ISI and memory performance. Previous studies suggest an inverted U-shaped relation between the ISI and retention, which may be summarised by the following three findings (e.g., Bird, 2010; Cepeda et al., 2006, 2008, 2009; Pashler et al., 2007; Rohrer & Pashler, 2007):
(1) Learning is optimal when the ISI falls on a particular point (hereafter, referred to as the optimal ISI). (2) When the ISIs are shorter than the optimal ISI, increasing spacing increases retention sharply. (3) When the ISIs are longer than the optimal ISI, increasing spacing decreases retention gradually.
Figure 14 is intended to illustrate a possible relationship between the ISI and retention
187 following the above three findings. The data presented here are adapted from Cepeda et al. (2009), but are based on a hypothetical experiment for illustrative purposes.
Finding 2: When the ISIs are shorter
Finding 3: When the ISIs are longer
than the optimal ISI, increasing
than the optimal ISI, increasing
spacing increases retention sharply.
spacing decreases retention gradually.
Retention
90% 80% 70% 60% 50% 40%
Finding 1: When the ISI falls on the
30%
optimal ISI, retention is highest.
20% 10% 0%
5
10
15
20
25
30
ISI (days) Figure 14. Inverted U-shaped relationship between the ISI and retention in a hypothetical experiment (adapted from Cepeda et al., 2009, p. 242).
Suppose that in this hypothetical experiment, the optimal ISI is 15 days. Hence, in Figure 14, retention is highest for the 15-day ISI (Finding 1). When the ISIs are shorter than the optimal ISI, increasing spacing may increase retention (Finding 2). An ISI of 10 days, therefore, may lead to superior retention to an ISI of 5 days as illustrated in Figure 14. In contrast, when the ISIs are longer than the optimal ISI, increasing spacing tends to decrease retention (Finding 3). Thus, a 30-day ISI may be less effective than a 20-day ISI. Note that the left-hand portion of the curve in Figure
188 14 is steeper than the right-hand portion. This is because although retention may increase sharply before the ISI reaches the optimal ISI (Finding 2), it is found to decrease only gradually beyond the optimal ISI (Finding 3). In other words, although a 10-day ISI may be significantly more effective than a 5-day ISI, for instance, there may be only a small difference between the effects of 20- and 30-day ISIs.
4.1.2 Empirical evidence: Relationship among the ISI, RI, and retention One may wonder how we can identify the optimal ISI (Figure 14, Finding 1). Some studies hypothesise that the optimal ISI may be determined as a function of the RI. Cepeda et al. (2006), for instance, conducted a meta-analysis of 317 experiments and estimated the optimal ISI to be around 15% of the RI. In other words, if the posttest is given 20 days after the treatment, a 3-day ISI (20 days x 15%) may yield the highest retention, whereas for a 200-day RI, a 30-day ISI (200 days x 15%) may be the most effective. Note that these predictions are congruent with the ISI-RI interaction, according to which short spacing tends to be effective at a short RI, whereas long spacing tends to be effective at a long RI (see 4.1.1). In Bird (2010), 38 Malay learners of English studied English tenses (simple past, present perfect, and past perfect) under 3- and 14-day ISI conditions. Two posttests were administered: 7 and 60 days after the treatment. Their study suggested that the optimal ISI might fall around 23% of the RI (14-day ISI / 60-day RI). In Cepeda et al. (2009, Experiment 2), 233 American undergraduate students studied 23 trivia facts (e.g., ‘Who invented snow golf?’ Ans. ‘Rudyard Kipling,’ p. 241) and 23 object names. Their study
189 indicated that the optimal ISI might fall around 16.7% of the RI (28-day ISI / 168-day RI). In Cepeda et al. (2008), 1,350 participants studied 32 trivia facts. Based on their findings, Cepeda et al. estimate that the ISIs of 3, 8, 12, and 27 days, which correspond to 42.9%, 22.9%, 17.1%, and 7.7% of the RIs, may be the most effective for 7-, 35-, 70-, and 350-day RIs, respectively. Their results seem to suggest an inverse relationship between the optimal ISI/RI ratio and RI. In other words, for a short RI (7 days), the optimal ISI/RI ratio may be high (42.9%), whereas for a long RI (350 days), the ratio may be low (7.7%). Although the optimal ISI/RI ratio may depend on various factors such as the RI (Cepeda et al., 2008), task complexity (Bird, 2010; Donovan & Radosevich, 1999), or type of posttest (Cepeda et al., 2008), previous studies suggest that it may fall between 10 - 30% of the RI (Bird, 2010; Cepeda et al., 2006, 2008, 2009; Pashler et al., 2007; Rohrer & Pashler, 2007). Hereafter, 10 - 30% of the RI is referred to as the optimal ISI range.
While numerous attempts have been made to identify the optimal ISI/RI ratio in the field of cognitive psychology (see above), a possible relationship between the ISI/RI ratio and retention has not been examined thoroughly in the L2 vocabulary acquisition literature. Although some studies compared the effects of short and long ISIs on L2 vocabulary learning, they may not allow us to fully examine a possible relationship between the ISI/RI ratio and retention for two reasons. First, some studies do not enable us to identify the optimal ISI because the ISI/RI ratios are neither reported nor can they be estimated (e.g., Karpicke & Bauernschmidt, 2011; Kornell, 2009; Pashler
190 et al., 2003; Pyc & Rawson, 2007, Experiment 1, 2009). Second, in other studies where the ISI/RI ratios are reported or can be estimated, only a limited range of ISI/RI ratios were used (e.g., Bahrick et al., 1993; Bahrick & Phelps, 1987; Crothers & Suppes, 1967; Pyc & Rawson, 2007, Experiment 2). Specifically, in order to obtain a comprehensive picture regarding the relationship between the ISI/RI ratio and retention, the following three kinds of ISIs may need to be used: an ISI that is shorter than, falls within, and is longer than the optimal range. None of the existing L2 vocabulary studies, however, examined the effects of the above three kinds of ISIs. For instance, all ISIs used by Bahrick et al. (1993), Bahrick and Phelps (1987), and Crothers and Suppes (1967, Experiment 11) were shorter than the optimal range. Pyc and Rawson (2007, Experiment 2) did not use an ISI that was longer than the optimal range. As a result, these studies may not allow us to fully examine a possible relationship between the ISI/RI ratio and L2 vocabulary learning.
Cepeda et al.'s (2009) Experiment 1 constitutes the only exception. In their experiment, 215 American undergraduate students studied 40 Swahili-English word pairs. Between-session spacing was manipulated, and the participants were assigned to one of the following six groups: 5-minute, 1-day, 2-day, 4-day, 7-day, and 14-day ISI groups. The treatments were conducted over two sessions, and the interval between the sessions varied according to the group. For instance, in the 5-minute ISI group, Session 2 took place 5 minutes after Session 1, whereas in the 14-day ISI group, there was an interval of 14 days between the two sessions. Learning was
191 measured by a receptive recall posttest administered 10 days after Session 2. Note that the ISI/RI ratio was 0.03%, 10%, 20%, 40%, 70%, and 140% for the 5-minute, 1-day, 2-day, 4-day, 7-day, and 14-day ISI groups, respectively. Cepeda et al. (2009) found that (a) the 5-minute ISI group was significantly less effective than all other groups, and (b) the 1-day ISI group led to a higher posttest score than the 2-, 4-, 7-, and 14-day ISI groups although the differences were not statistically significant. The results may partially support the hypothesis that the optimal ISI may fall between 10 30% of the RI because the 5-minute ISI group, who had an ISI that was shorter than the optimal range (0.03%), turned out to be the least effective. The 1- and 2-day ISI groups, for whom the ISI fell within the optimal range (10% and 20% of the RI, respectively), did not significantly outperform the 4-, 7-, and 14-day ISI groups, whose ISI was longer than the optimal range (40%, 70%, and 140% of the RI, respectively). This may be partly because retention tends to decrease only gradually beyond the optimal range (Finding 3 in Figure 14). Expanding upon Cepeda et al. (2009, Experiment 1), the present study also tested whether the optimal ISI may be determined as a function of the RI for L2 vocabulary learning. Unlike Cepeda et al. (2009), who examined between-session spacing, this study manipulated within-session spacing. Findings of this study may be useful because they may allow us to determine how much within-session absolute spacing should be used in order to optimise flashcard learning.
192 4.2 Effects of Relative Spacing 4.2.1 Theoretical background: Effects of equal and expanding spacing Not only absolute but also relative spacing may affect retention. Relative spacing refers to how study opportunities are distributed relative to one another (Karpicke & Bauernschmidt, 2011). Previous studies have looked into the effects of three types of relative spacing: equal, expanding, and contracting. Expanding spacing is often regarded as the most effective relative spacing schedule (e.g., Baddeley, 1997, pp. 112–114; Ellis, 1995; Hulstijn, 2001; Nation, 2001, pp. 76–77; Roediger & Karpicke, 2010; Schmitt & Schmitt, 1995; Schmitt, 2000, 2007). The superiority of expanding spacing over other forms of relative spacing is referred to as the expanding (Cull et al., 1996) or expanded retrieval effect (Balota et al., 2006, 2007; Carpenter & DeLosh, 2005; Logan & Balota, 2008; Maddox et al., 2011). The expanding retrieval effect may be accounted for by the retrieval practice effect and the retrieval effort hypothesis (Ellis, 1995; Finley et al., 2011; Karpicke & Bauernschmidt, 2011; Karpicke & Roediger, 2010; Logan & Balota, 2008; Storm et al., 2010). The retrieval practice effect (Baddeley, 1997, p. 112; Ellis, 1995) refers to the phenomenon where a successful retrieval from memory yields superior long-term retention to an unsuccessful retrieval (see 2.1). The retrieval effort hypothesis (Pyc & Rawson, 2009) states that the degree to which a successful retrieval enhances memory increases with the difficulty of the retrieval practice (see 3.1.2).
Taken together, the retrieval practice effect and retrieval effort hypothesis imply that
193 when a new lexical item is introduced, it should be practised after a relatively short ISI. Otherwise, retrieval may be unsuccessful, and learners may not be able to benefit from the positive effects of retrieval success. The second retrieval may occur after a longer interval than the first one because success on the initial retrieval may facilitate subsequent retrieval, allowing the target word to be successfully retrieved after a longer ISI. The third retrieval may likely be successful after an even longer interval because success on the second retrieval may further strengthen the memory. Similarly, gradually increasing the intervals between retrieval attempts may allow learners to help increase retrieval effort while at the same time, facilitating successful retrieval. The above discussion suggests that expanding spacing may be the most effective relative spacing schedule because it may allow learners to benefit from the positive effects of both retrieval success and retrieval effort (Finley et al., 2011; Karpicke & Roediger, 2010; Logan & Balota, 2008).
The retrieval practice effect and retrieval effort hypothesis predict that equal spacing may be less effective than expanding spacing. In equal spacing, the first retrieval opportunity occurs after a longer delay than in expanding spacing (e.g., 5-5-5 rather than 1-5-9). Due to greater spacing, equal spacing may produce a lower rate of initial retrieval success than expanding spacing. When the initial retrieval is unsuccessful, the subsequent retrievals are also likely to be unsuccessful because only a successful retrieval facilitates subsequent retrieval (e.g., Ellis, 1995; Kornell & Bjork, 2008; Modigliani, 1976). Equal spacing, hence, may lead to lower retrieval success
194 throughout the treatment than expanding spacing (Balota et al., 2007; Cepeda et al., 2006; Cull et al., 1996; Karpicke & Bauernschmidt, 2011; Karpicke & Roediger, 2007a; Storm et al., 2010. Note that this argument is based on the assumption that the treatment does not involve feedback. If the treatment involves feedback, equal spacing may result in similar levels of learning phase recall as expanding spacing because feedback may allow learners to correct errors. As a result, an expanding retrieval effect may not be observed. See 4.2.2 for a related discussion). Unlike expanding spacing, equal spacing may not increase retrieval effort either because the intervals between encounters are held constant. Because it may neither facilitate successful retrieval nor increase retrieval effort, equal spacing may be less effective than expanding spacing. Contracting spacing, where the initial retrieval attempt is further delayed than in equal spacing (e.g., 9-5-1 rather than 5-5-5), may be the least effective because not only may it produce lower retrieval success than expanding and equal spacing but also decrease retrieval effort as learning proceeds. Because theoretical as well as empirical work suggests that contracting spacing may not be very effective (Gay, 1973; Gerbier & Koenig, 2012; Karpicke & Bauernschmidt, 2011; Landauer & Bjork, 1978; Tsai, 1927), the following discussion on relative spacing will mainly be concerned with equal and expanding spacing.
Note that empirical studies have yielded mixed results regarding the retrieval practice effect, one of the theoretical underpinnings of the expanding retrieval effect. While some studies have supported the retrieval practice effect (e.g., Karpicke & Roediger,
195 2007a; Landauer & Bjork, 1978; Logan & Balota, 2008; Maddox et al., 2011; Storm et al., 2010), other studies have suggested that a treatment that induces unsuccessful retrievals during learning could be effective over time and that learning phase performance may not necessarily be a good index of long-term retention (e.g., Bjork, 1994, 1999; Karpicke & Bauernschmidt, 2011; Pashler et al., 2003; Schmidt & Bjork, 1992; Schneider et al., 2002). However, in all previous studies showing the expanding retrieval effect, expanding spacing produced higher learning phase recall than equal spacing (Dobson, 2011; Karpicke & Roediger, 2007a, Experiment 1; Landauer & Bjork, 1978; Logan & Balota, 2008; Maddox et al., 2011, Experiment 2; Storm et al., 2010, Experiments 2 & 3).8 At least in the studies on expanding spacing, therefore, retrieval success during learning seems to be associated with superior retention.
4.2.2 Empirical evidence: Effects of equal and expanding spacing Empirical efforts have been made to compare the effects of equal and expanding spacing. Note that when comparing the effects of different relative spacing schedules, they need to have equivalent absolute spacing (e.g., Balota et al., 2007; Carpenter & DeLosh, 2005; Cull et al., 1996; Cull, 2000; Karpicke & Roediger, 2007a; Logan & Balota, 2008; Shaughnessy & Zechmeister, 1992). Comparing the 1-3-5 expanding and 5-5-5 equal schedules, for instance, may not be valid because they are not matched in absolute spacing: The former has absolute spacing of 9 units (1 + 3 + 5 =
8
Cull et al. (1996) also found the benefits of expanding over equal spacing on the posttest. A relationship between learning phase recall and the posttest performance in their study is not clear because they do not report learning phase recall data.
196 9) while the latter has absolute spacing of 15 units (5 x 3 = 15). As a result, any potential difference between the two schedules may be partly due to the difference in absolute spacing rather than in relative spacing per se. In contrast, comparing the 1-3-5 and 3-3-3 schedules, for instance, is considered valid because the two schedules have equivalent absolute spacing (9 units). Previous studies that failed to control for absolute spacing (e.g., Dobson, 2012; Gay, 1973; Schuetze & Weimer-Stuckmann, 2010, 2011) were excluded from the following literature review because relative and absolute spacing were confounded in these studies. In addition, studies that do not involve retrieval during learning (Gerbier & Koenig, 2012; Tsai, 1927) were also excluded from the following literature review. Since retrieval increases learning (see 1.1), flashcard learning should involve retrieval in the form of recall or recognition (see 3.1.2). At the same time, previous studies indicate that the effects of equal and expanding spacing may be conditional upon the absence or presence of retrieval (Gerbier & Koenig, 2012; Landauer & Bjork, 1978). Thus, the findings of the earlier studies that do not include a retrieval component may not necessarily be applicable to flashcard learning, which should involve retrieval. For this reason, studies that do not involve retrieval practice were excluded.
Although expanding spacing is often regarded as the most effective relative spacing schedule (e.g., Baddeley, 1997, pp. 112–114; Ellis, 1995; Hulstijn, 2001; Nation, 2001, pp. 76–77; Roediger & Karpicke, 2010; Schmitt, 2000, 2007; Schmitt & Schmitt, 1995; see 4.2.1), studies comparing equal and expanding spacing have
197 yielded mixed results. Some studies failed to find any advantage of expanding over equal spacing in their posttest scores (Balota et al., 2006; Carpenter & DeLosh, 2005, Experiment 3; Cull et al., 1996, Experiments 2, 3, & 5; Cull, 2000, Experiments 1-4; Kang et al., 2013; Karpicke & Bauernschmidt, 2011; Karpicke & Roediger, 2010; Maddox et al., 2011, Experiment 1; Pyc & Rawson, 2007; Shaughnessy & Zechmeister, 1992; Storm et al., 2010, Experiment 1). Other studies suggest that the effects of equal and expanding spacing may be conditional upon several factors and found the superiority of expanding spacing under limited conditions: when (a) the task difficulty is high, (b) feedback is not provided, (c) the treatment involves long absolute spacing, and / or (d) the RI is shorter than 24 hours (see below).
First, research suggests that expanding spacing may be more effective than equal spacing when the task difficulty is high (Logan & Balota, 2008; Storm et al., 2010). As noted in 4.2.1, the expanding retrieval effect is based on the assumption that expanding spacing may produce higher success on the initial retrieval than other relative spacing schedules, which in turn may facilitate subsequent retrieval and lead to better long-term retention. If difficulty is high and memory decays rapidly, the initial retrieval needs to take place immediately after the initial encounter to facilitate successful retrieval. As a result, when the task difficulty is high, expanding spacing, which is associated with an immediate initial retrieval attempt (e.g., 1-5-9 rather than 5-5-5), may lead to higher levels of initial retrieval success than equal spacing, and a significant expanding retrieval effect may be observed. In contrast, if difficulty is low
198 and forgetting occurs slowly, the initial retrieval may be successful even in equal spacing, which is associated with a delayed initial retrieval attempt (e.g., 5-5-5 rather than 1-5-9). Because the assumption on which the expanding retrieval effect is based (i.e., expanding spacing produces higher success on the initial retrieval than equal spacing; see 4.2.1) will not be met, an expanding retrieval effect may not be found when difficulty is low and forgetting occurs slowly.
Logan and Balota (2008) and Storm et al. (2010) empirically demonstrated that the expanding retrieval effect may interact with task difficulty. Logan and Balota (2008) found that although expanding spacing was more effective than equal spacing for older adults (average age = 75.7 years), no significant difference existed between the two types of spacing for younger adults (average age = 19.8 years). Logan and Balota speculate that the expanding retrieval effect was observed only for older adults possibly because they tend to have a smaller memory capacity, and difficulty was higher for them. Storm et al. (2010) manipulated task difficulty by causing interference and also found the benefits of expanding spacing only when difficulty was high. A possible interaction between the expanding retrieval effect and task difficulty may partially account for the mixed results of previous studies.
Second, the effects of expanding spacing may also be conditional upon the absence or presence of feedback (Balota et al., 2007; Cepeda et al., 2006; Cull et al., 1996; Storm et al., 2010), which is referred to as the feedback hypothesis (Cepeda et al., 2006). A
199 possible interaction between feedback and relative spacing may also be caused by the differential positions of the initial retrieval between equal and expanding spacing. Because expanding spacing is often associated with immediate initial retrieval (see above), equal spacing may produce a lower rate of initial retrieval success than expanding spacing. When the initial retrieval is unsuccessful and the treatment does not involve feedback, the subsequent retrievals are also likely to be unsuccessful because learners do not have opportunities to correct errors. Without feedback, therefore, equal spacing may lead to lower retrieval success throughout the treatment than expanding spacing, which in turn may result in poorer long-term retention (Balota et al., 2007; Cepeda et al., 2006; Cull et al., 1996; Karpicke & Bauernschmidt, 2011; Karpicke & Roediger, 2007a; Storm et al., 2010). In contrast, as feedback may allow learners to correct errors on subsequent retrievals, it may eliminate a possible difference in learning phase performances between equal and expanding spacing. As a result, an expanding retrieval effect may not be observed when the treatment involves feedback.
Previous studies seem to support the feedback hypothesis. Feedback was not given in all previous experiments that found the superiority of expanding over equal spacing in their posttest scores (Cull et al., 1996, Experiments 1 & 4; Dobson, 2011; Karpicke & Roediger, 2007a, Experiment 1; Landauer & Bjork, 1978; Logan & Balota, 2008; Maddox et al., 2011, Experiment 2; Storm et al., 2010, Experiments 2 & 3), whereas studies providing feedback failed to find any advantage of expanding spacing (Balota
200 et al., 2006, Experiments 2 & 3; Cull et al., 1996, Experiment 5; Cull, 2000, Experiments 1-4; Kang et al., 2013; Karpicke & Roediger, 2007a, Experiments 2 & 3; Karpicke & Roediger, 2010, Experiment 2; Pyc & Rawson, 2007). The inconsistent findings of previous studies may also be attributed in part to a possible interaction between the expanding retrieval effect and feedback.
Third, studies suggest that relative spacing may also interact with absolute spacing (Dobson, 2012; Maddox et al., 2011). Maddox et al. (2011) observe that experiments that found the advantage of expanding over equal spacing used longer absolute spacing (4.78 spacing units on average) than those that did not (3.66 spacing units). Based on their analysis, Maddox et al. argue that the expanding retrieval effect tends to be observed when absolute spacing is relatively long. Absolute spacing may interact with relative spacing presumably because long spacing may increase task difficulty (Dobson, 2012). As discussed above, expanding spacing may be particularly effective when difficulty is high. Since long absolute spacing may increase difficulty, a significant expanding retrieval effect may also be found when the treatment involves long absolute spacing. A possible interaction between the expanding retrieval effect and absolute spacing could partially be responsible for the inconsistent results of previous studies as well.
Lastly, research also suggests that relative spacing may interact with the RI and that the advantage of expanding spacing may diminish as the RI increases (Balota et al.,
201 2007; Karpicke & Roediger, 2007a; Logan & Balota, 2008; Roediger & Karpicke, 2010; Storm et al., 2010). This interaction may also stem from the differential positions of the initial retrieval between equal and expanding spacing. Because the first retrieval opportunity usually occurs earlier in expanding spacing than in equal spacing (see above), learners may retrieve items from primary memory in expanding spacing, which may facilitate short-term, but not necessarily long-term memory (Karpicke & Roediger, 2007a; Roediger & Karpicke, 2010; Storm et al., 2010). In contrast, since equal spacing is associated with a delayed initial retrieval attempt, learners may be less likely to retrieve items from primary memory in equal spacing, which may facilitate long-term, but not necessarily short-term memory. As a result, although expanding spacing may be more effective than equal spacing at a short RI, its advantage may disappear over time.
The results of the previous studies seem consistent with the view that expanding spacing may be particularly effective at a short RI with several exceptions (Dobson, 2011; Storm et al., 2010, Experiments 2 & 3). In six experiments that found the benefits of expanding over equal spacing, the posttest was administered on the same day as the treatment (Cull et al., 1996, Experiments 1 & 4; Karpicke & Roediger, 2007a, Experiment 1; Landauer & Bjork, 1978; Logan & Balota, 2008; Maddox et al., 2011, Experiment 2). Most studies giving a posttest after delays greater than 24 hours failed to find any superiority of expanding spacing (Cull, 2000, Experiments 3 & 4; Kang et al., 2013; Karpicke & Bauernschmidt, 2011; Karpicke & Roediger, 2007a,
202 2010; Logan & Balota, 2008; Storm et al., 2010, Experiment 1). Storm et al. (2010, Experiments 2 & 3) and Dobson (2011) constitute exceptions because they found the superiority of expanding spacing at an RI greater than 24 hours. Their experiments might have yielded inconsistent results possibly because the expanding spacing conditions were optimal when (a) the task difficulty was high (Storm et al., 2010, Experiments 2 & 3), (b) feedback was not provided (Dobson, 2011; Storm et al., 2010, Experiments 2 & 3), and (c) the treatment involved long absolute spacing (3 days: Dobson, 2011), resulting in a larger expanding retrieval effect compared with other studies. As the above review of literature reveals, the expanding retrieval effect may be sensitive to changes in the experimental procedures such as task difficulty, feedback, absolute spacing, and the RI. This may be partly the reason why empirical studies have yielded mixed results regarding the effects of equal and expanding spacing.
4.2.3 Effects of equal and expanding spacing on L2 vocabulary learning Most previous studies comparing equal and expanding spacing have investigated the learning of materials other than L2 vocabulary such as pairs of high-frequency L1 words (Balota et al., 2006; Logan & Balota, 2008; Maddox et al., 2011), low-frequency L1 vocabulary (Karpicke & Roediger, 2007a), name pairs (Landauer & Bjork, 1978, Experiment 1), face-name pairs (Carpenter & DeLosh, 2005, Experiments 2 & 3; Cull et al., 1996, Experiments 1 & 2; Landauer & Bjork, 1978, Experiment 2), general facts (Cull et al., 1996, Experiments 3-5; Shaughnessy &
203 Zechmeister, 1992), and text materials (Karpicke & Roediger, 2010; Storm et al., 2010). Three studies investigated the effects of expanding and equal spacing on L2 vocabulary learning (Kang et al., 2013; Karpicke & Bauernschmidt, 2011; Pyc & Rawson, 2007) and constitute exceptions. These three studies failed to find any advantage of expanding over equal spacing in their posttest scores.
In Pyc and Rawson (2007), 161 American undergraduate students studied 48 Swahili-English word pairs. Target items were assigned to one of the following two schedules: 1-5-9 expanding and 5-5-5 equal spacing. Within-session spacing was manipulated, and the number of intervening trials was used as a unit of spacing. No statistically significant difference existed between the equal and expanding spacing conditions on the posttest conducted 40 minutes after the treatment. In Kang et al. (2013), 37 participants studied 60 Japanese-English word pairs under two conditions: equal and expanding spacing. The treatment was conducted over a period of 4 weeks, and between-session spacing was manipulated. In the equal spacing condition, the target words were studied on Days 1, 10, 19, and 28 (9-9-9 schedule). In the expanding condition, the target items were studied on Days 1, 3, 9, and 28 (2-6-19 schedule). Expanding spacing produced higher learning phase recall than equal spacing. However, on the posttest administered on Day 83, no statistically significant difference was found between the two schedules (equal = 46%, expanding = 49%).
In Karpicke and Bauernschmidt (2011), 96 American undergraduate students studied
204 24 Swahili-English word pairs. As in Pyc and Rawson (2007), within-session spacing was manipulated, and the number of intervening trials was used as a unit of spacing. In Karpicke and Bauernschmidt, three relative spacing conditions (equal, expanding, and contracting) were factorially crossed with three absolute spacing conditions (15, 30, and 90 intervening trials), resulting in the following nine schedules: 5-5-5 (short equal), 10-10-10 (medium equal), 30-30-30 (long equal), 1-5-9 (short expanding), 5-10-15 (medium expanding), 15-30-45 (long expanding), 9-5-1 (short contracting), 15-10-5 (medium contracting), and 45-30-15 (long contracting). One goal of their study was to compare the effects of the three types of relative spacing while ensuring that they would be equivalent in the levels of retrieval success during the treatment. More specifically, in previous studies, expanding spacing often produces higher learning phase recall than equal spacing (see 4.2.1) presumably because the first retrieval opportunity occurs earlier in expanding spacing (see 4.2.2). In order to isolate the effects of relative spacing and learning phase recall, in Karpicke and Bauernschmidt (2011), target items were studied to the criterion of one correct retrieval before being assigned to one of the three relative spacing schedules (equal, expanding, and contracting). This manipulation inhibited forgetting during learning and ensured that the three relative spacing schedules would result in similar levels of learning phase recall (equal: 94.4%, expanding: 95.8%, contracting: 93.8%). Although feedback was provided until a given item reached the criterion of one correct retrieval, no feedback was given for previously correctly retrieved items. On the posttest conducted 1 week after the treatment, the main effect of absolute spacing was
205 significant (M = 49%, 64%, and 75% for short, medium, and long spacing, respectively). Yet, neither the main effect of relative spacing nor the interaction between absolute and relative spacing reached significance. The results suggest that there was no statistically significant difference among equal, expanding, and contracting spacing regardless of absolute spacing.
Contrary to the view that expanding spacing may be particularly effective when the treatment involves long absolute spacing (Dobson, 2012; Maddox et al., 2011), Karpicke and Bauernschmidt (2011) did not find any significant interaction between relative and absolute spacing. The results might have been caused because their experimental design was a little biased against expanding spacing. As described above, in Karpicke and Bauernschmidt (2011), target items were studied to the criterion of one correct retrieval before being assigned to one of the three relative spacing schedules to ensure similar levels of learning phase performances. This manipulation might have favoured equal and contracting spacing because one benefit of expanding spacing may include its ability to produce more successful retrievals during learning than other forms of relative spacing (see 4.2.1). Therefore, by guaranteeing comparable levels of learning phase recall among the three relative spacing schedules, Karpicke and Bauernschmidt (2011) might have diluted a possible expanding retrieval effect. This may be partly the reason why Karpicke and Bauernschmidt failed to observe any superiority of expanding spacing irrespective of absolute spacing.
206 Limitations of previous L2 vocabulary studies Even though the findings of the three L2 vocabulary studies are very valuable, they may be limited in that they have not rigorously controlled variables that may influence the effects of equal and expanding spacing. As described in 4.2.2, earlier studies suggest that expanding spacing may be more effective than equal spacing when (a) the treatment involves long absolute spacing, (b) the RI is shorter than 24 hours, (c) feedback is not provided, and / or (d) the task difficulty is high. These findings suggest that when comparing the effects of equal and expanding spacing, it may be useful to consider the effects of at least the following four variables: absolute spacing, the RI, feedback, and task difficulty.
Let us consider the implications of possible interactions between the expanding retrieval effect and the above four variables. First, prior studies suggest that the effects of relative spacing may depend on absolute spacing (Dobson, 2012; Maddox et al., 2011). The finding suggests that even if one particular expanding schedule, 1-5-9, for instance, fails to outperform an equal interval schedule, it may not be valid to generalise the finding to other types of expanding schedules such as 5-10-15 or 15-30-45. In order to obtain a comprehensive picture regarding the effects of equal and expanding spacing, it may be useful to use several expanding spacing schedules that differ in absolute spacing. Manipulating relative and absolute spacing simultaneously may also have pedagogical value because it may allow us to determine the relative importance of the two types of spacing on vocabulary learning (Karpicke
207 & Bauernschmidt, 2011). Among the three L2 vocabulary studies, only Karpicke and Bauernschmidt (2011) manipulated absolute spacing. Their experimental design, however, might have been a little biased against expanding spacing (see above), which may partially account for their failure to observe any significant interaction between relative and absolute spacing. The present study attempted to investigate the interaction between relative and absolute spacing in a more rigorous manner than Karpicke and Bauernschmidt.
Second, existing studies also suggest that the expanding retrieval effect tends to be observed when the RI is shorter than 24 hours (e.g., Balota et al., 2006; Karpicke & Roediger, 2007a; Logan & Balota, 2008). Due to the possible interaction between relative spacing and the RI, the results could be misleading if a posttest is given at only one RI. Yet, all three L2 vocabulary studies administered a posttest at only one RI: 40 minutes (Pyc & Rawson, 2007), 7 days (Karpicke & Bauernschmidt, 2011), and 55 days (Kang et al., 2013). With this limitation in mind, the present study administered posttests at two different RIs: immediately and 1 week after the treatment. Some may argue that an immediate posttest may not be necessary because from a pedagogical perspective, scores on a delayed posttest may be more important. Although this may be true, a treatment that enhances short-term learning could be useful as long as it does not inhibit long-term retention because in a real-life situation, learners may sometimes require the knowledge of L2 words that they have just learnt. Hence, this study administered delayed as well as immediate posttests in order to test
208 the possible interaction between relative spacing and the RI.
Third, existing studies suggest that expanding spacing may be particularly effective when the treatment does not involve feedback (e.g., Balota et al., 2007; Cepeda et al., 2006; Cull et al., 1996; Storm et al., 2010). Although this interaction may be of theoretical importance, manipulating the presence or absence of feedback may have limited pedagogical value because not only does feedback enhance learning (e.g., Metcalfe & Kornell, 2007) but also paper-based flashcard learning typically involves feedback (e.g., Karpicke & Bauernschmidt, 2011; Karpicke & Roediger, 2007a; Karpicke, Smith, & Grimaldi, 2009). Feedback also appears to be a common feature among existing flashcard software. In all nine flashcard programs surveyed by Nakata (2011), feedback is given after each response. In order to increase learning and ecological validity, therefore, it may be useful to always provide feedback after retrievals when examining the effects of expanding spacing on vocabulary learning. Although feedback was provided in Pyc and Rawson (2007) and Kang et al. (2013), it was not given for previously correctly retrieved items in Karpicke and Bauernschmidt (2011). Thus, it is unclear to what extent their findings may be applicable to authentic flashcard learning, which typically involves feedback. While the provision of feedback may favour equal spacing (see 4.2.2), it was decided to give feedback after retrievals in the current study in order to increase learning and ecological validity.
Fourth, prior studies suggest that a significant expanding retrieval effect may be
209 observed when the task difficulty is high. This finding may be particularly pertinent to the retrieval format during learning (receptive recognition, productive recognition, receptive recall, or productive recall) and pacing of the treatment (self-paced or computer-paced). As for the retrieval format, research suggests that productive retrieval is more demanding than receptive retrieval and that recall is more difficult than recognition (see 3.1.2). Incidentally, all three previous L2 vocabulary studies used a receptive recall format during the treatment. From a pedagogical perspective, productive retrieval may be more desirable because it may result in adequate gains in receptive knowledge as well as large gains in productive knowledge (see 3.1.1). Thus, although the use of a productive format may increase task difficulty and favour expanding spacing, it was decided to use productive recall in the present study in order to increase learning. Recall, rather than recognition, formats were used because recall may enhance learning more than recognition (see 3.1.2). As for the pacing of the treatment, imposing a time limit for retrieval practice may increase task difficulty (Nation, 1982, 2001, p. 305). In all three L2 vocabulary studies, retrieval practice was paced by the computer, and participants were required to type a response within 6 (Kang et al., 2013) or 8 seconds (Karpicke & Bauernschmidt, 2011; Pyc & Rawson, 2007). Pedagogically, a self-paced treatment might be more desirable in terms of effectiveness and ecological validity (see 2.2.7). Hence, in this study, retrieval practice was self-paced, and no time limit was imposed although a self-paced treatment may favour equal spacing by decreasing task difficulty.
210 In summary, earlier studies suggest that when comparing the effects of equal and expanding spacing on L2 vocabulary learning, it may be useful to meet the following five conditions: (a) manipulate absolute as well as relative spacing, (b) give a posttest at more than one RI, (c) provide feedback after retrievals, (d) use productive recall during learning, and (e) do not impose a time limit for retrieval practice. None of the previous L2 vocabulary studies, however, meets all of these five conditions. Pyc and Rawson (2007) and Kang et al. (2013) satisfy only (c). Karpicke and Bauernschmidt’s (2011) study meets (a) and (c) only partially. With the limitations of the existing L2 vocabulary studies in mind, the present study compared the effects of equal and expanding spacing while meeting all of the above five conditions. By controlling variables that may influence the effects of equal and expanding spacing, the current study may allow us to determine how relative spacing may influence flashcard learning in a more rigorous manner than earlier studies.
4.3 The Present Study The purpose of the present study was to identify the optimal within-session absolute and relative spacing schedules for L2 vocabulary learning. Regarding absolute spacing, previous studies indicate that the optimal ISI may be determined relative to the RI (see 4.1.2). However, a possible relationship between the ISI/RI ratio and retention has not been examined thoroughly in L2 vocabulary acquisition studies. An exception is Cepeda et al. (2009, Experiment 1), who manipulated between-session spacing. Expanding on their experiment, this study tested whether the optimal ISI may
211 be determined as a function of the RI while manipulating within-session spacing. Findings of this study may be useful because they may allow us to determine how much within-session absolute spacing should be used in order to optimise flashcard learning.
As for relative spacing, prior studies suggest that the effects of equal and expanding spacing may be conditional upon factors such as absolute spacing, the RI, feedback, and task difficulty (see 4.2.2). These findings suggest that when comparing the effects of equal and expanding spacing on L2 vocabulary learning, it may be useful to meet the following five conditions: (a) manipulate absolute as well as relative spacing, (b) give a posttest at more than one RI, (c) provide feedback after retrievals, (d) use productive recall during learning, and (e) do not impose a time limit for retrieval practice. However, none of the previous L2 vocabulary studies meets all these five conditions (Kang et al., 2013; Karpicke & Bauernschmidt, 2011; Pyc & Rawson, 2007). With the limitations of the existing studies in mind, the present study compared the effects of equal and expanding spacing while satisfying the above five conditions. Findings of this study may allow us to determine which kind of relative spacing schedule should be used in order to optimise flashcard learning.
Research questions The following two research questions were addressed in this study:
212 1. Can the optimal within-session ISI be determined relative to the RI for L2 vocabulary learning? 2. Is expanding spacing more effective than equal spacing for L2 vocabulary learning when (a) the treatment involves long absolute spacing, (b) the RI is shorter than 24 hours, (c) feedback is provided after retrievals, (d) the treatment involves productive recall, and (e) no time limit is imposed for retrieval practice?
4.4 Pilot Studies Pilot studies were conducted with eight Japanese learners of English in New Zealand to identify any potential problems with the methodology of this experiment. No major problem was found in the pilot studies.
4.5 Method 4.5.1 Participants The original pool of participants consisted of 167 Japanese learners of English. Forty-eight of them were first-year law majors at the same university as in Experiments 1A and 1B (hereafter, OU). Their average score on the first to the sixth 1,000-word frequency levels of the Vocabulary Size Test (VST: Nation & Beglar, 2007) was 30.91 (SD = 7.03) out of 60. The other 119 participants were first-year engineering students at a technical college in Gifu, Japan (hereafter, GTC). No data were available regarding their English proficiency. Out of the 167 participants, 35 were dropped because they were absent on the day of the experiment, chose not to
213 participate, or their data were lost due to technical problems, leaving 132 students. The participants were randomly assigned to the four absolute spacing groups (massed, short, medium, and long). A further four students were excluded from analysis in order to (a) counterbalance the effects of target items across participants and (b) ensure that the four absolute spacing groups would consist of an equal number of participants from OU and GTC (see 4.6), leaving 128 students. Each of the massed, short, medium, and long groups consisted of 32 participants (24 GTC and 8 OU students). The average VST score of the OU students (SDs in parentheses) was 30.75 (6.71), 27.50 (6.95), 32.75 (5.99), and 32.63 (8.30) out of 60 for the massed, short, medium, and long groups, respectively. There was no significant difference in the VST scores among the four groups, H (3) = 2.75, p = .432. None of the participants exhibited prior knowledge of any of the target words on a productive pretest regardless of the scoring procedure (see 4.5.7 for details about the pretest). It is assumed, therefore, that the four absolute spacing groups did not differ from one another in terms of their productive knowledge of the target words studied under the equal and expanding conditions at the outset of the experiment. The results of the receptive pretest will be discussed in 4.6.
4.5.2 Experimental design There were three independent variables in the current study. The first independent variable was absolute spacing: massed, short, medium, and long. The second independent variable was relative spacing: equal and expanding. The third
214 independent variable was the retention interval (RI): immediate and 1-week delayed posttests. The present experiment employed a mixed design. Absolute spacing was a between-participant variable, and relative spacing and the RI were within-participant variables. The dependent variables were effectiveness and efficiency of the absolute and relative spacing conditions. Effectiveness was measured by the number of correct responses on the posttest. Efficiency was defined as the number of words acquired per minute and calculated by dividing the posttest score by the study time (e.g., Kornell, 2009; Mondria, 2003; Pyc & Rawson, 2007).
Figure 15 summarises the design of the current study. As described in 4.5.1, the participants were randomly assigned to one of the four absolute spacing groups: massed, short, medium, and long. Each of the four groups consisted of 32 participants. Hereafter, the massed group will be referred to as the control group, and the short, medium, and long spacing groups will be collectively referred to as the experimental groups. The participants in the three experimental groups were randomly divided into two subgroups of 16 participants: Subgroups X and Y. Twenty target word pairs were also divided into two sets of 10 items: Sets A and B. The two subgroups of participants in each experimental group studied both sets of word pairs under different relative spacing conditions (equal or expanding), thus counterbalancing the effects of target items (see Figure 15). In other words, Subgroup X in the short spacing group studied Set A under the equal interval condition (5-5-5) and Set B under the expanding condition (1-5-9). Subgroup Y in the short spacing group studied Set B
215 under the equal interval condition and Set A under the expanding condition. The same was true for the medium and long spacing groups. In the control group, both Sets A and B were studied in the massed schedule (0-0-0).
Note. aEach item set consisted of 10 items. Figure 15. Design of Study 3.
As shown in Figure 15, the following six spacing conditions were used in this study: short equal (5-5-5), short expanding (1-5-9), medium equal (10-10-10), medium expanding (5-10-15), long equal (30-30-30), and long expanding (15-30-45). See 4.5.4 for the justification for using these six schedules. As in Study 1, the number of trials, not time, was used as an index of spacing (see 2.2.8), and the numbers in Figure 15 indicate the number of intervening trials in each schedule. For instance, in the 5-5-5 schedule, encounters of a given item were always separated by five trials. In the massed schedule (0-0-0), all target items were studied four times in a row. In order to isolate the effects of relative and absolute spacing, the equal and expanding schedules in each experimental group were matched in the total number of intervening trials (see
216 4.2.2). For instance, in the short spacing group, both equal (5-5-5) and expanding (1-5-9) schedules had absolute spacing of 15 intervening trials (5 x 3 = 15; 1 + 5 + 9 = 15).
As noted above, the number of trials was used as a unit of spacing in this study. At the same time, the average amount of time between repetitions was analysed after the experiment to investigate whether absolute spacing in the equal and expanding conditions was equivalent when time is used as an index of spacing. The analysis indicated that the two relative spacing conditions could be regarded as having roughly equivalent absolute spacing whether trial or time is used as a spacing unit (see 4.6.2).
4.5.3 Procedure The procedure in this study was exactly the same as in Experiment 1B except for the treatment (see 4.5.6). As in Experiment 1B, the present experiment consisted of three sessions. In Session 1, participants received explanations about the study. Session 2 was comprised of the practice period, pretest, treatment, filler task, immediate posttest, and questionnaire. In Session 3, the delayed posttest was administered. There was an interval of 1 week between Sessions 1 and 2, and 2 and 3.
4.5.4 Absolute and relative spacing schedules The present study compared the same six equal and expanding schedules (5-5-5, 1-5-9, 10-10-10, 5-10-15, 30-30-30, and 15-30-45; Figure 15) as in Karpicke and
217 Bauernschmidt (2011). These six conditions were chosen for two reasons. First, the use of the above six conditions may allow us to test the view that expanding spacing may be more effective than equal spacing when the treatment involves long absolute spacing (see 4.2.2). Specifically, pilot studies with eight Japanese learners of English suggested that when the six spacing schedules in Figure 15 were used, the mean ISI would be 60.56, 147.22, and 359.26 seconds in the short, medium, and long spacing groups, respectively. Because all previous experiments on paired-associate learning that have found an advantage of expanding spacing used average within-session spacing of 60 seconds or shorter (5 - 60 seconds; Cull et al., 1996; Karpicke & Roediger, 2007a; Landauer & Bjork, 1978; Logan & Balota, 2008; Maddox et al., 2011), it was judged that using the spacing schedules in Figure 15 may allow us to investigate whether expanding spacing may be particularly effective when the treatment involves long absolute spacing. Second, the use of the six spacing conditions in Figure 15 may also allow us to examine whether the optimal ISI might be determined relative to the RI (see 4.3). Pilot studies suggested that when the six spacing schedules in Figure 15 were used, the mean ISI/RI ratio would be 9.2%, 21.0%, and 86.5% on the immediate posttest for the short, medium, and long spacing groups, respectively. (The mean ISI/RI ratio would be 0% in the massed group because there was no spacing.) If we assume that the optimal ISI falls within 10 - 30% of the RI (see 4.1.2), the ISIs in the three experimental groups would correspond to one of the following: an ISI that is shorter than (short spacing: 9.2%), falls within (medium spacing: 21.0%), and is longer than the optimal range (long spacing: 86.5%).
218 The use of the six spacing schedules in Figure 15, therefore, may enable us to examine a possible relationship between the ISI/RI ratio and memory performance. Based on the above two reasons, it was decided to use the same six schedules as in Karpicke and Bauernschmidt (2011).
Note that on the delayed posttest, the ISIs of all three experimental groups would likely be much shorter than the optimal range (10 - 30% of the RI). In the pilot studies, the mean ISI/RI ratio on the delayed posttest was 0.01%, 0.02%, and 0.04% for the short, medium, and long spacing groups, respectively. As a result, none of the three experimental treatments may be very effective on the 1-week delayed posttest. Nonetheless, it was decided to use the six spacing schedules in Figure 15 because in a study manipulating within-session spacing, it would not be feasible to use an ISI that falls within the optimal range at a relatively long RI. For instance, for a 1-week RI, an ISI of 16.8 (24 hours x 7 days x 10%) to 50.4 hours (24 hours x 7 days x 30%) may fall within the optimal range. If we assume that each target word is encountered four times during the treatment, in order to use a within-session ISI of 16.8 to 50.4 hours, a single study session needs to last for at least 50.4 (16.8 hours x 3) to 151.2 hours (50.4 hours x 3), which is not practical. As a result, the six spacing schedules in Figure 15 were used although they might not necessarily be optimal on the delayed posttest.
219 4.5.5 Target and filler words The same 20 low frequency English words from Study 1 were used as target items (see 2.2.6). They were chosen because Study 1 indicated that Japanese university students had little prior knowledge of these words. The 20 word pairs were divided into Sets A and B so that the learning difficulty would be distributed as evenly as possible (Figure 16). Learning difficulty was operationally defined as learners’ performance in Experiment 1A. Specifically, the two sets of items were matched for the following five variables: learners’ scores on the receptive pretest, immediate receptive posttest, delayed receptive posttest, immediate productive posttest, and delayed productive posttest in Experiment 1A (see Table 29). Each item set consisted of six nouns and four verbs (Figure 16). It was not possible to use the data from Experiment 1B because Study 3 was conducted prior to Experiment 1B. The Mann-Whitney nonparametric tests showed that no statistically significant difference existed between the two sets in any of the variables, U < 50.00, p > .545, r < .19. Although the lack of statistical significance does not necessarily guarantee that the two sets are completely equivalent in their difficulty, it was judged that a possible difference, if any, might not have a major effect on the results of the present study because effects of target items would be counterbalanced across participants (see Figure 15).
220 Set Target words A apparition, billow, citadel, fracas, gouge, levee, nadir, pique, rue, warble B cadge, dally, fawn, grig, loach, mane, mirth, quail, scowl, toupee Figure 16. Target items by item set. Underlined words are verbs while others are nouns.
Table 29 Pretest and Posttest Scores in Experiment 1A Receptive Pretest Set A
B
Immediate
Productive Delayed
Immediate
Delayed
Strict
Sensitive
Strict
Sensitive
Strict
Sensitive
Strict
Sensitive
Strict
Sensitive
M
0.42%
0.42%
71.3%
74.7%
52.0%
53.8%
59.3%
69.1%
13.7%
25.5%
SD
0.70%
0.70%
14.1%
11.8%
15.9%
14.3%
16.7%
15.0%
10.1%
10.9%
M
0.53%
0.53%
71.3%
71.5%
52.1%
52.5%
62.2%
69.3%
16.8%
25.2%
SD
0.71%
0.71%
15.6%
15.7%
18.8%
19.0%
16.3%
14.9%
12.5%
14.4%
Note. n = 95 for the pretest. n = 91 for the immediate and delayed posttests because four students who demonstrated prior knowledge of one or more target words on the receptive pretest were excluded (see 2.2.3). The productive pretest was not administered in Experiment 1A.
Thirteen additional items were used as filler items: abet, banter, exalt, husk, jibe, polemic, promontory, smudge, tyro, urn, usurp, valor, and vestige. They were chosen based on the same criteria as the target items (see 2.2.6). Filler items were used to manipulate spacing. They were also used as primacy and recency buffers (see 4.5.6 for further details). Filler items were treated in exactly the same way as target items during the treatment, and participants were not informed that filler items would be used. The number of filler items was set to 13 based on the following calculations: As each target word was encountered four times during learning (see 4.5.6), each filler item was encountered no more than four times so that participants might not differentiate between filler and target items. Because there were 50 trials for filler items (see 4.5.6), and each filler item was encountered up to four times, at least 12.5
221 (50 / 4 = 12.5) filler items were needed. Therefore, the number of filler items was set to 13.
4.5.6 Treatment In the treatment, 33 English-Japanese word pairs (including 13 filler items) were studied with a flashcard program. Each target item was encountered four times throughout the treatment in all four groups (see below for the justification for setting the number of encounters to four). As in Experiment 1B, in the first encounter with each item, the English and Japanese words were presented simultaneously for 8 seconds per word pair (initial presentation). In the remaining encounters, items were practised in a productive recall format (see 2.2.7). After each response, feedback was shown for 5 seconds per response.
The treatments in this study were different from those in Experiment 1B in three respects: the independent variable, number of encounters with target words, and feedback. First, whereas block size was manipulated during the treatment in Experiment 1B, absolute and relative spacing were manipulated in this study. Second, the number of encounters with target words during learning was reduced from five to four. The number of encounters was set to four for three reasons. First, Crothers and Suppes (1967, Experiments 8 & 9) suggest that 85 - 88% of the items in their study were acquired after six (Experiment 9) or seven encounters (Experiment 8). Considering that the current study involved fewer items (33 including fillers) than
222 Crothers and Suppes (108 in Experiment 8 and 216 in Experiment 9), encounters of six may lead to a ceiling effect, reducing the potential of showing a difference between conditions. Second, the results of the pilot studies suggested that neither a ceiling nor floor effect was observed on the posttests with four encounters. Third, in most previous studies on equal and expanding spacing, there are four encounters for each target item (e.g., Balota et al., 2006; Carpenter & DeLosh, 2005; Cull et al., 1996; Cull, 2000; Karpicke & Roediger, 2007a; Landauer & Bjork, 1978, Experiment 1; Logan & Balota, 2008; Pyc & Rawson, 2007; Shaughnessy & Zechmeister, 1992). By using the same number of encounters as existing studies, we may be able to compare the results of the present and previous studies. With these considerations in mind, the number of encounters with target words was set to four.
Lastly, the present study and Experiment 1B also differed in feedback after retrievals. In Experiment 1B, there were two types of feedback: one for a correct response (Figure 4, top) and the other for an incorrect response (Figure 4, bottom). In this study, a third type of feedback (Figure 17) was displayed when the response was judged to be partially correct by the computer program. Partially correct responses were defined as those that would be awarded 0.75 using LPSP (see 2.2.11). This new type of feedback was added because it might help motivate participants: Since incorrect responses during learning may potentially demotivate learners (Fritz, 2010; Logan & Balota, 2008; Mondria & Mondria-de Vries, 1994), it was judged that acknowledging partially correct responses may have a positive effect on learners’ motivation. Other
223 than the independent variable, number of encounters with target words, and feedback, the treatments in this study were exactly the same as in Experiment 1B.
Figure 17. Feedback for a partially correct response. English translations are provided on the right.
Order of items The order of items during the treatment was determined based on six principles. First, there were three primacy and recency buffers at the beginning and end of the treatment in all four absolute spacing groups to lessen the influence of serial position effects (Karpicke & Roediger, 2007a; see 2.2.6). Second, the item order was determined so that the trials for the seven schedules (0-0-0, 5-5-5, 1-5-9, 10-10-10, 5-10-15, 30-30-30, and 15-30-45) would be distributed roughly equally across the treatment. In other words, items assigned to a particular review schedule should not be clustered at the beginning or end of the treatment. This was done because the positions of the trials might affect learning (e.g., Delaney et al., 2010; Karpicke &
224 Roediger, 2007a). The average positions of the trials (SDs in parentheses) were 56.90 (38.31), 56.80 (36.69), 56.80 (35.60), 56.30 (34.23), 56.30 (32.99), 58.00 (36.53), and 57.50 (35.06) for the 0-0-0, 5-5-5, 1-5-9, 10-10-10, 5-10-15, 30-30-30, and 15-30-45 schedules, respectively. The trials for the seven schedules, therefore, were judged to be distributed roughly equally across the treatment. Third, in order to manipulate spacing, filler items were studied in a position that was not occupied by any of the target items (e.g., Cull et al., 1996; Karpicke & Bauernschmidt, 2011; Karpicke & Roediger, 2007a; Logan & Balota, 2008; Maddox et al., 2011; Pyc & Rawson, 2007). Fourth, the order of items was determined so that all four absolute spacing groups would have an equal number of filler trials. This is because the number of filler trials may affect learning. The number of filler trials was set to 50 because the long spacing group, which required the largest number of trials for filler items, needed 50 filler trials to complete the treatment. Fifth, the order of items was determined so that the filler items would have roughly the same amount of absolute spacing as the target items in each group. This was done so that participants might not differentiate between filler and target items. The mean ISI of the filler items was 0.00 (0.00), 5.70 (4.07), 13.57 (7.64), and 25.59 (14.31) trials in the massed, small, medium, and long spacing groups, respectively, whereas that of the target items was 0.00 (0.00), 5.00 (2.31), 10.00 (2.89), and 30.00 (8.66) trials in the massed, small, medium, and long spacing groups. The filler and target items in each group, therefore, were regarded as having roughly equivalent absolute spacing. Sixth, the item order was randomised anew for each participant to minimise the potential of an order effect.
225
4.5.7 Pretest Immediately before the treatment, productive and receptive recall tests were given in that order as the pretest. The formats of the pretests were exactly the same as in Experiment 1B (see 2.3.4). In addition to the 20 target words, 13 filler items and three practice words (apple, orange, and banana) were tested. There was one question for each word, and each pretest consisted of 36 questions. At the beginning of the pretest, the three practice words were tested in order to familiarise participants with the test format. Responses for filler and practice items were not included in the results.
As in Experiment 1B, in order to prevent learners from providing synonyms in the productive pretest, one letter in the target word and the number of letters in the word (hereafter, retrieval cue) were given together with the Japanese translation (e.g., _ _ n _ for mane). The retrieval cues in the present study were determined based on the same four rules as in Experiment 1B (see 2.3.4). Appendix F summarises the retrieval cues used in this study. See Appendix A for examples of the productive and receptive pretests. Responses on the pretest were scored using the same procedures as in Study 1: strict and sensitive (see 2.2.11).
4.5.8 Dependent measures Immediate and delayed posttests were administered in the present study. Two tests of form and meaning were given at each test administration: productive recall and
226 receptive recall in that order. The formats of the posttests were exactly the same as in Study 1 (see 2.2.10). The immediate posttest was given on the same day as the treatment. The delayed posttest was administered 1 week after the treatment. The interval of 1 week was chosen for two reasons. First, since studies have shown that most forgetting occurs immediately after learning, scores on a 1-week delayed posttest may be a good indication of retention over time (see 2.2.10). Second, in pilot studies, no floor effect was observed on the 1-week delayed posttest scores. Based on the above two reasons, the RI of 1 week was chosen. The delayed posttest was administered without prior notice so that participants would not review the target words during the period between the treatment and delayed posttest.
Although all 13 filler items were tested in the pretest, only three filler items (husk, polemic, and smudge) appeared in the posttest. This is because the posttest was administered after the treatment, and whether or not filler items were tested in the posttest would not have affected how much attention learners had paid to fillers during the treatment. There was one question for each word, and each posttest consisted of 23 questions. The three filler items were tested at the beginning of each posttest in order to familiarise participants with the test format. See Appendix B for examples of the productive and receptive posttests. Responses on the posttest were scored using the same procedures as in Study 1: strict and sensitive (see 2.2.11). Responses for filler items were not included in the results.
227 4.5.9 Questionnaire After the immediate posttest, a questionnaire was administered in Japanese to examine participants’ perceptions about the effect of review schedules on learning. The three experimental groups were asked to evaluate the usefulness of the equal and expanding schedules for learning on a 7-point scale, where 1 means Not helpful at all and 7 means Very helpful. Participants in the control group were asked to evaluate the usefulness of the massed schedule for learning on the same 7-point scale.
4.6 Results As described in 4.5.2, the participants were randomly assigned to one of the four absolute spacing groups: massed (control), short, medium, and long. The participants in the three experimental groups were also randomly divided into Subgroups X and Y. Table 30 summarises the distribution of the participants in each subgroup. For instance, the table shows that the short spacing group consisted of 11 Subgroup X and 13 Subgroup Y participants from GTC and five Subgroup X and four Subgroup Y participants from OU. In order to counterbalance the effects of target items across participants, each of the three experimental groups needs to consist of an equal number of participants from Subgroups X and Y (see 4.5.2). Table 30, however, shows that there was an imbalance in the short spacing group: 16 from Subgroup X and 17 from Subgroup Y. One Subgroup Y participant needs to be excluded from analysis so that the short spacing group will have an equal number of participants (16) from both subgroups.
228
Table 30 Number of Participants in Study 3
GTC OU Total
X 11 5 16
Short Y Total 13 24 4 9 17 33
X 12 5 17
Absolute spacing groups Medium Long Y Total X Y Total 12 24 13 11 24 5 10 3 5 8 17 34 16 16 32
X -
Massed Y Total 24 9 33
Note. X = Subgroup X; Y = Subgroup Y (see 4.5.2).
Because each of the four absolute spacing groups consisted of an equal number of participants from GTC (24), and dropping a GTC student from the short spacing group will cause an imbalance in the number of GTC participants, one OU participant will be excluded from Subgroup Y of the short spacing group. Ideally, a participant needs to be dropped while minimising effects on the mean posttest scores of this subgroup. In order to determine which participant should be excluded, the productive posttest scores of the four OU students in Subgroup Y of the short spacing group are tabulated in Table 31. For instance, the table shows that Participant 1 in this subgroup obtained 9, 12, 0, and 0, respectively, for each of the four productive posttest measures (i.e., strict scoring of the immediate, sensitive scoring of the immediate, strict scoring of the delayed, and sensitive scoring of the delayed productive posttests). In this study, both productive and receptive posttests were conducted (see 4.5.8). Only the productive posttest scores will be considered here because scores on the productive posttest are the main dependent variables in this study: As target words were practised only in productive recall in the current experiment (see 4.5.6), scores
229 on the productive posttest, which used exactly the same format as the productive recall format during learning, might be a more direct and reliable measure of learning outcomes than those on the receptive posttest (see 2.3.4).
Table 31 Productive Posttest Scores of OU Students in Subgroup Y of the Short Spacing Group Immediate Delayed Strict Sensitive Strict Sensitive Participant 1 9 12 0 0 Participant 2 17 18 4 6 Participant 3 4 8 0 1 Participant 4 9 16 0 3 M 9.75 13.50 1.00 2.50 SD 4.66 3.84 1.73 2.29
Table 32 summarises how dropping a particular student may affect the average posttest scores of Subgroup Y of the short spacing group, OU. In Table 32, the columns with the heading Scores after exclusion give the M and SD of the productive posttest scores when a given participant is excluded from this subgroup. For instance, the table shows that when Participant 1 is dropped, the average productive posttest scores in this subgroup will be 10.00, 14.00, 1.33, and 3.33 for the four measures. The columns with the heading Absolute values of differences before and after exclusion give the absolute values of the changes in the mean posttest scores after a particular participant is dropped. For instance, Table 32 shows that excluding Participant 1 will change the average score of this subgroup by 0.25 (10.00 - 9.75 = 0.25), 0.50 (14.00 13.50 = 0.50), 0.33 (1.33 - 1.00 = 0.33), and 0.83 (3.33 - 2.50 = 0.83), respectively, for each of the four productive posttest measures. Average in Table 32 gives the
230 change in the means when collapsed across the four measures. For instance, 0.48 is given for Average for Participant 1. This means that when collapsed across all four measures, dropping Participant 1 will change the mean score of this subgroup by 0.48 ([0.25 + 0.50 + 0.33 + 0.83] / 4 = 0.48). Table 32 suggests that excluding Participant 1 may have the smallest effect on the mean posttest scores of this subgroup. As a result, it was decided to exclude Participant 1 from the analysis.
Table 32 Productive Posttest Scores of OU Students in Subgroup Y of the Short Spacing Group (After One Participant Is Excluded) Absolute values of differences Scores after exclusion before and after exclusion Participant
Immediate
Delayed
Immediate
Delayed Average
to exclude Participant 1
Participant 2
Participant 3
Participant 4
Strict
Sensitive
Strict
Sensitive
Strict
Sensitive
Strict
Sensitive
M
10.00
14.00
1.33
3.33
0.25
0.50
0.33
0.83
0.48
SD
5.35
4.32
1.89
2.05
M
7.33
12.00
0.00
1.33
2.67
2.00
1.33
2.00
2.00
SD
2.36
3.27
0.00
1.25
M
11.67
15.33
1.33
3.00
4.33
3.33
1.33
1.67
2.67
SD
3.77
2.49
1.89
2.45
M
10.00
12.67
1.33
2.33
1.67
2.67
0.00
0.67
1.25
SD
5.35
4.11
1.89
2.62
When Participant 1 was dropped, there were eight, 10, eight, and nine OU students in the short, medium, long, and massed groups, respectively, whereas each of the four groups consisted of an equal number of GTC participants (24). Because OU and GTC students might differ in variables that may influence the ability to learn from flashcards (e.g., English proficiency, computer skills, or memory capacity), the four absolute spacing groups should consist of an equal number of OU and GTC students.
231 In order to redress the imbalance in the number of OU students, two and one OU students were dropped from the medium and massed groups, respectively, using the same procedure described above. After three OU students were excluded, each of the short, medium, long, and massed groups consisted of 24 GTC and 8 OU students, leaving 128 students in total (32 x 4 = 128). Each of the three experimental groups also had an equal number of participants (16) from Subgroups X and Y. The data from these 128 students were analysed in this study.
Pretest None of the participants exhibited prior knowledge of any of the target words on the productive pretest. Participants did not provide synonyms for a target word on the productive pretest either possibly due in part to the provision of retrieval cues (e.g., _ _ n _ for mane). Table 33 summarises the results of the receptive pretest. Because correct responses in the receptive pretest were used as cues in the productive pretest, administering the latter prior to the former might have affected performance on the receptive pretest. Consequently, some participants might have been able to answer correctly in the receptive pretest without having any prior knowledge (see 2.3.6). As the correct responses on the receptive pretest may not necessarily indicate their prior knowledge, participants who answered correctly on the receptive pretest were not excluded from analysis. In order to correct for differences in the pretest scores, gains from the pretest to the posttest were analysed when examining the receptive test results (see 4.6.1 and 4.6.2).
232
Table 33 Number of Correct Responses on the Receptive Pretest Strict scoring Sensitive scoring Absolute spacing Equal Expanding Total Equal Expanding Total a Massed M 0.25 0.31 SD 0.57 0.59 Short M 0.13 0.06 0.19 0.16 0.13 0.28 SD 0.34 0.25 0.47 0.37 0.34 0.52 Medium M 0.16 0.19 0.34 0.19 0.22 0.41 SD 0.37 0.64 0.79 0.40 0.66 0.80 Long M 0.16 0.22 0.38 0.19 0.22 0.41 SD 0.37 0.42 0.66 0.40 0.42 0.67 Note. n = 32 for each absolute spacing group. The maximum score is 10 for Equal and Expanding and 20 for Total. a There is no score for equal and expanding spacing in the massed group because the massed schedule involved neither equal nor expanding spacing (see 4.5.2).
4.6.1 Effects of absolute spacing Study time First, we will consider the effects of absolute spacing. For the three experimental groups, the data obtained in the equal and expanding conditions were added together when examining the effects of absolute spacing. For instance, if a participant scored 7 under the equal condition and 8 under the expanding condition on a given posttest, 15 was used as his or her posttest score because both of the scores had the same absolute spacing. Let us examine at the outset whether the study time was comparable among the four absolute spacing groups. The average study time (SDs in parentheses) was 12.72 (1.90), 15.37 (2.45), 15.43 (2.32), and 15.90 (1.87) minutes in the massed, short, medium, and long spacing groups, respectively. A one-way ANOVA found a statistically significant difference among the four groups, F (3, 127) = 14.36, p < .001,
233 η2 = .07. The Bonferroni method of multiple comparisons showed that (a) all three experimental groups used significantly more time than the control group (p < .001), producing large effect sizes (1.21 < d < 1.69), and (b) no statistically significant difference existed among the three experimental groups in study time (p = 1.000), and no more than small effect sizes were found (0.03 < d < 0.24). The average response latency during retrieval practice (the time required for participants to type a response for each retrieval attempt) was 5.06 (1.90), 7.70 (2.45), 7.76 (2.32), and 8.23 (1.87) seconds in the massed, short, medium, and long spacing groups, respectively. This means that although retrieval practice was self-paced in this study, the response latency was not considerably longer compared with the previous L2 vocabulary studies, where retrieval practice was paced by the computer (6 seconds: Kang et al., 2013; 8 seconds: Karpicke & Bauernschmidt, 2011; Pyc & Rawson, 2007; see 4.2.3).
Next, let us investigate how much time intervened between the repetitions of target words in the three experimental groups. Repetitions of a given item were separated by 58.97 (9.61), 117.00 (24.51), and 355.91 (44.92) seconds on average in the short, medium, and long spacing groups, respectively. The difference in the time between the repetitions reflects the difference in the number of intervening trials (short: 15, medium: 30, long: 90; Figure 15). There was no spacing in the massed group because all target items were studied four times in a row. At the time of the immediate posttest, the average ISI/RI ratio was 8.9% (0.7%), 21.8% (11.9%), and 91.3% (15.9%) in the short, medium, and long spacing groups, respectively. The results indicate that the
234 average ISI/RI ratios in this experiment were similar to those in the pilot studies (see 4.4) and that the ISIs in the experimental groups corresponded to one of the following three target ISI/RI ratios as intended: an ISI that is shorter than (short spacing: 8.9%), falls within (medium spacing: 21.8%), and is longer than the optimal range (long spacing: 91.3%). On the delayed posttest, the average ISI was far shorter than the optimal range as expected: 0.01% (0.00%), 0.02% (0.00%), and 0.06% (0.01%) of the RI in the short, medium, and long spacing groups, respectively.
Learning phase performance There were three retrieval attempts for each target word during the treatment (see 4.5.6). Table 34 summarises the number of correct responses for the three retrieval attempts. In order to determine whether any significant difference existed among the four groups, the number of correct responses was submitted to a two-way mixed design 4 (absolute spacing: massed / short / medium / long) x 3 (retrieval attempt: 1 / 2 / 3) ANOVA. The ANOVA detected significant main effects of absolute spacing, F (3, 124) = 53.24, p < .001, partial η2 = .56, and retrieval attempt, F (1.77, 219.96) = 184.78, p < .001, partial η2 = .60. The interaction between the two variables was also significant, F (5.32, 219.96) = 11.13, p < .001, partial η2 = .21.
235 Table 34 Number of Correct Responses During the Learning Phase Retrieval attempts Absolute spacing 1 2 3 Massed M 17.72 18.53 18.78 SD 2.02 1.88 1.18 Short M 8.66 11.84 13.81 SD 4.23 5.32 5.00 Medium M 7.41 10.50 12.50 SD 4.29 4.34 4.61 Long M 4.16 6.78 8.72 SD 3.49 4.63 5.06 Note. n = 32 for each group. The maximum score is 20 for each cell. Responses were scored with the strict scoring procedure (see 2.2.12).
Due to the significant interaction between absolute spacing and the retrieval attempt, the simple main effect of absolute spacing was tested to investigate where the significant differences lay. The simple main effect of absolute spacing was significant on all three retrieval attempts, F (124, 3) > 30.22, p < .001. To follow-up the significant simple main effect, the Bonferroni method of multiple comparisons was used to examine where the significant differences lay at each retrieval attempt. The results of the multiple comparisons are summarised in Table 35. For instance, the table shows that the difference between the massed and short groups was statistically significant on the first retrieval attempt (p < .001), and a large effect size was observed (d = 2.73). Table 35 also indicates the following two things. First, regardless of the retrieval attempt, no statistically significant difference existed between the short and medium groups (p = 1.000), showing small effect sizes (0.27 < d < 0.29). Second, the differences were statistically significant for all other comparisons (p < .004), and medium to large effect sizes were found (0.78 < d < 4.75). The findings suggest the
236 following order in the degree of success on all three retrieval attempts: massed > short = medium > long.
Table 35 Results of Multiple Comparisons for Learning Phase Performance Among the Absolute Spacing Groups Absolute spacing Retrieval Absolute Massed Short Medium Long attempts spacing p d p d p d p d 1 Massed Short .000 2.73 Medium .000 3.07 1.000 0.29 Long .000 4.75 .000 1.16 .003 0.83 2 Massed Short .000 1.68 Medium .000 2.40 1.000 0.28 Long .000 3.32 .000 1.01 .004 0.83 3 Massed Short .000 1.37 Medium .000 1.87 1.000 0.27 Long .000 2.74 .000 1.01 .003 0.78 Note. Effect sizes (d) of 0.20, 0.50, and 0.80 are indicative of small, medium, and large effects, respectively (Cohen, 1988).
Posttest performance Table 36 provides the immediate and delayed posttest results for the four absolute spacing groups. The productive and receptive recall test scores were analysed by a two-way mixed design 4 (absolute spacing) x 2 (RI: immediate / 1-week delayed) ANOVA. As some items were answered correctly on the receptive pretest (Table 33), the pretest scores were subtracted from the posttest scores, and gains were analysed when examining the receptive test results. Table 37 shows the results of the ANOVAs. The table indicates the following three things. First, the main effect of RI was
237 significant on both productive and receptive tests regardless of the scoring system. In other words, the delayed posttest scores were significantly lower than the immediate posttest scores. Second, the ANOVAs showed a significant main effect of absolute spacing on both productive and receptive tests with both scoring procedures. Third, the interaction between absolute spacing and the RI was significant on the productive posttest regardless of the scoring protocol, but not on the receptive posttest.
238
Table 36 Number of Correct Responses on the Posttests Immediate posttest Delayed posttest Productive Receptive Productive Receptive Absolute spacing Strict Sensitive Strict Sensitive Strict Sensitive Strict Sensitive Massed 6.31 8.41 9.25 9.53 2.72 4.25 7.66 7.84 M 4.51 4.87 5.11 5.33 2.77 3.34 4.85 4.97 SD Short 14.06 12.09 12.59 3.91 6.63 10.75 11.03 M 11.31 4.86 4.05 4.57 4.71 3.30 4.27 4.66 4.77 SD Medium 14.81 13.28 13.88 5.63 8.63 12.38 12.66 M 12.81 4.79 4.53 4.90 4.93 4.31 4.71 4.83 4.95 SD Long 12.88 13.94 14.22 5.78 8.06 12.97 13.31 M 10.41 4.75 3.95 3.93 4.01 4.30 4.17 4.73 4.80 SD Note. n = 32 for each group. The maximum score is 20 for each cell.
Collapsed across the RIs Productive Receptive Strict Sensitive Strict Sensitive 4.52 6.33 8.45 8.69 3.05 3.50 4.72 4.88 7.61 10.34 11.42 11.81 3.69 3.80 4.33 4.45 9.22 11.72 12.83 13.27 4.02 3.97 4.41 4.43 8.09 10.47 13.45 13.77 3.95 3.54 3.80 3.91
239
Table 37 Results of Two-Way ANOVAs for the Posttest Scores Posttests Productive
Receptive
Effects Absolute spacing RI Absolute X RI Absolute spacing RI Absolute X RI
df 3 1 3 3 1 3
Strict scoring F p 9.47 .000 233.19 .000 6.40 .000 7.93 .000 13.49 .000 0.25 .865
2
partial η .19 .65 .13 .16 .10 .01
df 3 1 3 3 1 3
Sensitive scoring F p 12.77 .000 232.09 .000 3.89 .011 8.20 .000 15.97 .000 0.27 .843
partial η2 .24 .65 .09 .17 .11 .01
240
Due to the significant main effect of absolute spacing, contrasts were performed to investigate where the significant differences lay when collapsed across the immediate and delayed posttests (Field, 2009). Table 38 presents the p values and effect sizes d for the pair-wise contrasts. For instance, the table shows that when collapsed across the RIs, with strict scoring on the productive posttest, the difference between the massed and short groups was statistically significant (p = .001), and a large effect size was observed (d = 0.91). Table 38 also indicates the following four things. First, when collapsed across the RIs, the three experimental groups significantly outperformed the massed group on both productive and receptive tests with both scoring procedures, and medium to large effect sizes were observed (0.68 < d < 1.44). Second, with strict scoring on the productive posttest, the difference between the short and medium groups showed a tendency towards statistical significance (p = .084), and a small effect size was observed (d = 0.42). Third, with strict scoring on the receptive posttest, the difference between the short and long groups also showed a tendency towards significance (p = .093), producing a small effect size (d = 0.45). Fourth, the differences were not statistically significant for all other comparisons, and no more than small effect sizes were found (0.03 < d < 0.43).
241
Table 38 Results of Pair-Wise Contrasts on the Posttest (Collapsed Across the RIs) Productive posttest Absolute Massed Short Medium Scoring spacing p d p d p d Strict Massed Short .001 0.91 Medium .000 1.32 .084 0.42 Long .000 1.01 .601 0.13 .226 0.28 Sensitive Massed Short .000 1.10 Medium .000 1.44 .140 0.35 Long .000 1.18 .893 0.03 .180 0.33
Long p d
Massed p d
Receptive posttest Short Medium p d p d
.006 .000 .000
0.68 0.94 1.13
.254 .093
0.28 0.45
.587
0.14
.005 .000 .000
0.69 0.97 1.13
.232 .101
0.30 0.43
.652
0.12
Long p d
242 Next, in order to break down the significant interaction between absolute spacing and the RI on the productive posttest (Table 37), the simple main effect of RI was tested to examine whether the RI had differential effects on the four absolute spacing groups. The results of the simple main effect tests are summarised in Table 39. The table shows that the delayed posttest scores were significantly lower than the immediate posttest scores in all four groups with both scoring procedures. Larger effect sizes were observed for the short group (strict: 2.25, sensitive: 1.74) compared with the other three (with strict scoring, massed: 1.30, medium: 1.67, long: 1.08; with sensitive scoring, massed: 1.24, medium: 1.31, long: 1.16). Together with the interaction graphs (Figure 18), the results suggest that the productive posttest scores of the short spacing group might have decayed significantly more than those for the other three groups between the immediate and delayed posttests.
Table 39 Results of Simple Main Effect of the RI on the Productive Posttest Scoring Absolute spacing F p Strict Massed 23.15 .000 Short 98.32 .000 Medium 92.59 .000 Long 38.34 .000 Sensitive Massed 31.42 .000 Short 100.60 .000 Medium 69.63 .000 Long 42.12 .000 Note. df = (3, 124).
Δ 1.30 2.25 1.67 1.08 1.24 1.74 1.31 1.16
243
Figure 18. Number of correct responses on the productive posttest. Brackets enclose +1 SE.
244 Due to the significant main effect of absolute spacing and interaction between absolute spacing and the RI (Table 37), the Bonferroni method of multiple comparisons was also used to investigate where the significant differences lay at each RI. The results of the multiple comparisons are summarised in Table 40. First, let us consider the immediate posttest results. On the immediate productive posttest, the short, medium, and long groups significantly outperformed the massed group with both strict and sensitive scoring, and large effect sizes were observed (0.88 < d < 1.40). There was no statistically significant difference among the three experimental groups in their immediate productive posttest scores, and only medium or smaller effect sizes were observed (0.17 < d < 0.50). On the immediate receptive posttest, the medium and long groups fared significantly better than the massed group with both strict and sensitive scoring, producing relatively large effect sizes (0.79 < d < 0.99). The difference between the short and massed groups on the immediate receptive posttest only showed a tendency towards statistical significance, yielding medium-sized effects (p = .084 and d = 0.60 with strict scoring and p = .062 and d = 0.62 with sensitive scoring). As in the productive posttest, no statistically significant difference existed among the three experimental groups in their immediate receptive posttest scores, and no more than small effect sizes were observed (0.08 < d < 0.38).
Next, let us consider the scores on the delayed posttest. On the delayed productive and receptive posttests, the medium and long groups fared significantly better than the massed group with both strict and sensitive scoring, producing large effect sizes (0.80
245 < d < 1.11). The short spacing group failed to significantly outperform the massed group on the delayed productive and receptive posttests regardless of the scoring system, and small to medium effect sizes were found (0.39 < d < 0.67). Although the significant interaction between absolute spacing and the RI suggests that the productive posttest scores of the short group might have decayed significantly more than those of the other groups (see above), there was no statistically significant difference among the three experimental groups on the delayed productive posttest, yielding no more than small effect sizes (0.04 < d < 0.49). The differences among the experimental groups did not reach statistical significance on the delayed receptive posttest either.
246
Table 40 Results of Multiple Comparisons for Posttest Scores at Each RI Immediate posttest Absolute Massed Short Medium Posttests Scoring spacing p d p d p d Productive Strict Massed Short .000 1.07 Medium .000 1.40 1.000 0.31 Long .004 0.88 1.000 0.19 .264 0.50 Sensitive Massed Short .000 1.26 Medium .000 1.36 1.000 0.17 Long .000 1.01 1.000 0.30 .469 0.46 Receptive Strict Massed Short .084 0.60 Medium .006 0.79 1.000 0.22 Long .001 0.99 .949 0.38 1.000 0.14 Sensitive Massed Short .062 0.62 Medium .003 0.84 1.000 0.24 Long .001 0.97 1.000 0.34 1.000 0.08
Long p d
Massed p d
Delayed posttest Short Medium p d p d
1.000 .014 .008
0.39 0.80 0.85
.406 .279
0.45 0.49
1.000
0.04
.143 .000 .002
0.62 1.07 1.01
.338 1.000
0.44 0.34
1.000
0.13
.059 .001 .000
0.67 0.95 1.08
1.000 .563
0.30 0.43
1.000
0.11
.057 .001 .000
0.67 0.96 1.11
1.000 .479
0.30 0.45
1.000
0.13
Long p d
247 In summary, the posttest scores indicate that (a) medium and long spacing significantly outperformed massed learning on both productive and receptive tests irrespective of the RI and scoring system, (b) short spacing significantly outperformed massed learning only on the immediate productive posttest, (c) the effects of short spacing might not have been as durable as those of the other three treatments, (d) no significant difference existed among small, medium, and long spacing regardless of the posttest, scoring system, or RI, (e) when collapsed across the RIs, the medium group scored higher than the short group with strict scoring on the productive posttest, and the difference showed a tendency towards significance, and (f) when collapsed across the RIs, the long group fared better than the short group with strict scoring on the receptive posttest, and the difference showed a tendency towards significance.
Efficiency As there was a statistically significant difference in study time among the four absolute spacing groups (see above), the efficiency of the four groups was also compared. Efficiency was defined as the number of words acquired per minute and calculated by dividing the posttest score by the study time (e.g., Kornell, 2009; Mondria, 2003; Pyc & Rawson, 2007). As some items were answered correctly on the receptive pretest (Table 33), gains were used when calculating the efficiency scores for the receptive tests. Table 41 provides the efficiency scores in the four groups. In order to test whether any significant difference existed among the four groups, the efficiency scores were entered into a two-way mixed design 4 (absolute spacing) x 2
248 (RI) ANOVA. Table 42 shows the results of the ANOVAs. The table indicates the following three things. First, the main effect of RI was significant on both productive and receptive tests regardless of the scoring system. In other words, the efficiency scores on the delayed posttests were lower than those on the immediate posttests. Second, the ANOVAs detected a significant main effect of absolute spacing on the productive test with both scoring procedures, but not on the receptive test. Third, the interaction between absolute spacing and the RI was significant with strict scoring on the productive posttest, but not on the other three measures.
Table 41 Efficiency Scores in the Four Absolute Spacing Groups Immediate posttest Productive
Absolute spacing Massed Short Medium Long
Delayed posttest
Receptive
Productive
Receptive
Strict
Sensitive
Strict
Sensitive
Strict
Sensitive
Strict
Sensitive
M
0.51
0.68
0.73
0.75
0.22
0.34
0.60
0.61
SD
0.36
0.41
0.43
0.45
0.25
0.28
0.40
0.40
M
0.78
0.96
0.82
0.84
0.27
0.46
0.72
0.73
SD
0.39
0.35
0.38
0.39
0.25
0.33
0.37
0.38
M
0.83
0.96
0.84
0.88
0.37
0.57
0.81
0.82
SD
0.33
0.32
0.34
0.35
0.28
0.30
0.38
0.39
M
0.68
0.83
0.87
0.89
0.37
0.52
0.81
0.83
SD
0.35
0.31
0.31
0.31
0.29
0.28
0.35
0.35
Note. n = 32 for each group.
249 Table 42 Results of Two-Way ANOVAs for the Efficiency Scores Strict scoring
Sensitive scoring
Posttests
Effects
df
F
p
partial η
df
F
p
partial η2
Productive
Absolute spacing
3
4.11
.008
.09
3
4.70
.004
.10
RI
1
194.71
.000
.61
1
187.69
.000
.60
Absolute X RI
3
4.13
.008
.09
3
2.26
.085
.05
Absolute spacing
3
1.69
.172
.04
3
1.79
.152
.04
RI
1
9.78
.002
.07
1
11.31
.001
.08
Absolute X RI
3
0.62
.603
.01
3
0.56
.642
.01
Receptive
2
Due to the significant main effect of absolute spacing and the interaction between absolute spacing and the RI, the Bonferroni method of multiple comparisons was used to examine where the significant differences lay on the productive posttest. The results of the multiple comparisons are summarised in Table 43. The multiple comparisons were not performed for the receptive test because neither the main effect of absolute spacing nor the interaction between absolute spacing and the RI was significant (Table 42). Table 43 indicates the following three things. First, on the immediate productive posttest, the short and medium groups were significantly more efficient than the massed group with both strict and sensitive scoring, and medium to large effect sizes were observed (0.72 < d < 0.93). Second, with sensitive scoring on the delayed productive posttest, the medium group fared significantly better than the massed group, producing a medium sized effect (d = 0.75). Third, the differences were not statistically significant for all other comparisons, and medium or smaller effect sizes were found (0.01 < d < 0.61). Overall, the results indicate that there was little difference in the efficiency scores except the following: short = medium > massed on the immediate productive test and medium > massed on the delayed
250 productive test with sensitive scoring. The results suggest that although the control group required the least study time, it was no more efficient than any of the experimental groups.
251
Table 43 Results of Multiple Comparisons for Efficiency Scores (Productive Posttest) Immediate posttest Delayed posttest Absolute Massed Short Medium Long Massed Short Medium Long Scoring spacing p d p d p d p d p d p d p d p d Strict Massed Short .018 0.72 1.000 0.19 Medium .003 0.93 1.000 0.14 .199 0.55 .923 0.36 Long .385 0.47 1.000 0.28 .514 0.46 .164 0.56 .795 0.37 1.000 0.02 Sensitive Massed Short .009 0.74 .804 0.37 Medium .008 0.78 1.000 0.01 .024 0.75 .940 0.34 Long .482 0.42 .861 0.39 .788 0.42 .139 0.61 1.000 0.19 1.000 0.16 Note. The multiple comparisons were not performed for the receptive test because neither the main effect of absolute spacing nor the interaction between absolute spacing and the RI was significant (Table 42).
252 Questionnaire In the questionnaire given after the immediate posttest, the three experimental groups were asked to evaluate the usefulness of the equal and expanding schedules for learning on a 7-point scale, where 1 means Not helpful at all and 7 means Very helpful. Participants in the control group were asked to evaluate the usefulness of the massed schedule for learning on the same 7-point scale. In the experimental groups, the scores for equal and expanding spacing were averaged out and compared with those for the massed schedule. For instance, if a participant in the short spacing group gave 5 for equal and 6 for expanding spacing, 5.5 was used as this participant’s rating for the short spacing schedule. Because there was little difference between the ratings for equal (M = 4.87, SD = 1.52) and expanding spacing (M = 4.83, SD = 1.55; see 4.6.2 for details), it was judged appropriate to use the mean score of the two relative spacing schedules to represent a given absolute spacing schedule.
The average rating (SDs in parentheses) was 3.78 (1.41), 5.13 (1.09), 4.83 (1.73), and 4.60 (1.44) for the massed, short, medium, and long spacing schedules, respectively.9 A one-way ANOVA found a statistically significant difference among the four groups, F (3, 123) = 5.18, p = .002, η2 = .01. According to the Bonferroni method of multiple comparisons, the scores for the short and medium spacing schedules were significantly higher than those for the massed schedule, producing medium to large effect sizes (massed vs. short: p = .002, d = 1.07; massed vs. medium: p = .025, d =
9
One participant in the long spacing group did not provide responses.
253 0.66). The difference between the massed and long schedules did not reach statistical significance (p = .154), but a medium effect size was observed (d = 0.59). The differences were not statistically significant for all other comparisons (p < 1.000), and no more than small effect sizes were observed (0.15 < d < 0.42). These results suggest that (a) the participants perceived the small and medium spacing schedules to be more effective than the massed schedule, and (b) the three experimental groups did not differ significantly from each other in their responses.
4.6.2 Effects of relative spacing Study time Next, we will consider the effects of equal and expanding spacing. Because the massed schedule involved neither equal nor expanding spacing, the massed group was excluded, and only the short, medium, and long spacing groups were compared in this section. First, let us examine whether the study time was comparable between the equal and expanding schedules. The average study time (SDs in parentheses) was 7.68 (1.47), 7.69 (1.14), 7.60 (1.12), 7.83 (1.26), 7.84 (0.97), and 8.06 (1.15) minutes in the short equal, short expanding, medium equal, medium expanding, long equal, and long expanding schedules, respectively. When collapsed across the absolute spacing groups, the average study time was 7.71 (1.20) minutes for equal and 7.86 (1.18) minutes for expanding spacing. The study time was analysed by a two-way mixed design 2 (relative spacing: equal / expanding) x 3 (absolute spacing: short / medium / long) ANOVA. None of the main effects or interaction was significant, F (1, 93) = 3.01, p
254 = .086, partial η2 = .03 for the main effect of relative spacing, F (2, 93) = 0.53, p = .589, partial η2 = .01 for the main effect of absolute spacing, and F (2, 93) = 0.66, p = .517, partial η2 = .01 for the interaction between the two variables. The results are also supported by the small effect sizes (.01 < partial η2 < .03). The 95% confidence intervals of difference were also narrow: [-0.31, 0.29] in the short, [-0.53, 0.08] in the medium, and [-0.53, 0.08] in the long spacing groups. As the difference in study time was rather small, the efficiency scores (posttest score divided by the study time; see 4.5.2) were not calculated for the comparison of equal and expanding spacing.
Next, let us investigate how much time intervened between the repetitions of target words in the equal and expanding spacing conditions. Table 44 gives the average number of seconds between encounters of a given word pair in the three groups. The table shows that repetitions of a given item were separated by 59.24 (9.64), 58.70 (9.61), 116.54 (24.51), 117.46 (24.54), 353.74 (45.32), and 358.08 (44.75) seconds on average in the short equal, short expanding, medium equal, medium expanding, long equal, and long expanding schedules, respectively. The results suggest that the mean ISIs of the six spacing conditions were as long as or longer than in all previous studies on paired-associate learning that have found an advantage of expanding spacing (5 60 seconds). This enables us to test the view that expanding spacing may be particularly effective when the treatment involves long absolute spacing (see 4.2.2). Table 44 also suggests that whether trial or time is used as an index of spacing, the intervals between encounters were gradually increased in expanding spacing, whereas
255 the intervals were held roughly constant in equal spacing.
Table 44 Average Spacing (Seconds) by Absolute Spacing and Relative Spacing Absolute Relative Encounters spacing spacing E1-E2 E2-E3 E3-E4 Average Short Equal M 63.60 57.95 56.18 59.24 5-5-5 SD 10.53 9.47 10.33 9.64 Expanding M 15.87 59.43 100.82 58.70 1-5-9 SD 1.90 11.21 16.72 9.61 Medium Equal M 114.78 125.06 109.78 116.54 10-10-10 SD 21.57 31.80 23.44 24.51 Expanding M 57.90 126.17 168.30 117.46 5-10-15 SD 10.40 31.79 34.40 24.54 Long Equal M 345.31 365.41 350.51 353.74 30-30-30 SD 52.20 62.68 47.20 45.32 Expanding M 177.23 371.63 525.39 358.08 15-30-45 SD 26.85 70.32 66.59 44.75 Note. n = 32 for each group. E1 = encounter 1; E2 = encounter 2; E3 = encounter 3; E4 = encounter 4.
In order to investigate whether absolute spacing in the equal and expanding conditions was equivalent when time, not trial, was used as an index of spacing (see 4.5.4), the average amount of time between repetitions of a given item in the six spacing conditions (Average in Table 44) was analysed by a two-way mixed design 2 (relative spacing) x 3 (absolute spacing) ANOVA. The ANOVA showed a significant main effect of relative spacing, F (1, 93) = 16.14, p < .001, partial η2 = .15. This means that when collapsed across the three experimental groups, expanding spacing had longer absolute spacing than equal spacing. However, the difference in the means (178.08 seconds for expanding and 176.51 seconds for equal spacing) was rather small. The 95% confidence intervals of difference were also narrow [0.79, 2.35]. These findings
256 suggest that although statistically significant, the difference between the two relative spacing schedules in the average amount of time between repetitions might not have been substantively (Kline, 2004) or practically significant (Kirk, 1996).
The interaction between relative and absolute spacing was also significant, F (2, 93) = 13.66, p < .001, partial η2 = .23. Due to the significant interaction, the simple main effect of relative spacing was tested to examine where the significant differences lay. The simple main effect tests showed that expanding spacing had a significantly longer time between repetitions than equal spacing in the long spacing group, F (1, 93) = 41.01, p < .001. Nonetheless, the difference in the means (expanding: 358.08 seconds, equal: 353.74 seconds) was rather small, yielding a very small effect size (Δ = 0.10). The 95% confidence intervals of difference were also narrow [2.99, 5.68]. The results suggest that despite statistical significance, the difference between equal and expanding spacing might have been negligible for the long spacing group. No statistically significant difference existed between the two relative spacing schedules in either the short, F (1, 93) = 0.63, p = .428, or medium spacing group, F (1, 93) = 1.82, p = .180, producing very small effect sizes (0.04 < Δ < 0.06). The 95% confidence intervals of difference were also narrow: [-0.81, 1.88] in the short and [-2.26, 0.43] in the medium spacing groups. In summary, although the ANOVA showed a significant main effect and interaction, given the small effect sizes and narrow confidence intervals of difference, it might be reasonable to assume that equal and expanding spacing had roughly equivalent absolute spacing in all experimental
257 groups whether trial or time is used as an index of spacing. The statistical significance may be partially due to the relatively large cell size (32 x 3 = 96). When the cell size is large, even a small effect can be statistically significant (e.g., Kline, 2004; Norris & Ortega, 2000).
Learning phase performance Table 45 summarises the number of correct responses during the learning phase as a function of absolute and relative spacing. In order to determine whether any significant difference existed between the two relative spacing schedules, the number of correct responses was submitted to a three-way mixed design 2 (relative spacing) x 3 (absolute spacing) x 3 (retrieval attempt: 1 / 2 / 3) ANOVA. According to the ANOVA, the main effect of relative spacing was not statistically significant, F (1, 186) = 0.19, p = .660, partial η2 < .01. The interaction between relative spacing and the retrieval attempt was significant, F (2, 186) = 9.76, p < .001, partial η2 = .09. None of the other interactions involving relative spacing were statistically significant, F (2, 186) = 1.18, p = .313, partial η2 = .02 for the interaction between relative spacing and absolute spacing, and F (4, 186) = 0.59, p = .673, partial η2 = .01 for the interaction between relative spacing, absolute spacing, and the retrieval attempt.
258 Table 45 Number of Correct Responses During the Learning Phase Equal spacing Expanding spacing Absolute spacing Retrieval 1 Retrieval 2 Retrieval 3 Retrieval 1 Retrieval 2 Retrieval 3 Short M 3.84 5.91 7.00 4.81 5.94 6.81 (n = 32) SD 2.64 2.72 2.51 2.07 2.91 2.64 Medium M 3.44 5.19 6.44 3.97 5.31 6.06 (n = 32) SD 2.41 2.44 2.59 2.12 2.15 2.27 Long M 2.00 3.56 4.56 2.16 3.22 4.16 (n = 32) SD 1.93 2.69 2.83 1.87 2.28 2.55 Total M 3.09 4.89 6.00 3.65 4.82 5.68 (n = 96) SD 2.45 2.77 2.82 2.29 2.71 2.71 Note. The maximum score is 10 for each cell. Responses were scored with the strict scoring procedure (see 2.2.11).
As the interaction between relative spacing and the retrieval attempt proved significant, the simple main effect of relative spacing was tested to examine where the significant differences lay. When collapsed across the three experimental groups, expanding spacing significantly outperformed equal spacing on the first retrieval attempt, F (1, 95) = 9.59, p = .003, partial η2 = .09, Δ = 0.22. On the third retrieval attempt, equal spacing fared significantly better than expanding spacing, F (1, 95) = 4.01, p = .048, partial η2 = .04, Δ = 0.12. Yet, no more than small effect sizes were found (0.12 < Δ < 0.22). The difference was not significant on the second retrieval attempt, F (1, 95) = 0.13, p = .724, producing a very small effect size (partial η2 < .01, Δ = 0.02). Taken together, the results suggest that although expanding spacing performed slightly better than equal spacing on the first retrieval attempt, its advantage was not observed on either the second or third retrieval.
259 Posttest performance Table 46 summarises the immediate and delayed posttest results as a function of absolute and relative spacing. The productive and receptive test scores were analysed by a three-way mixed design 2 (relative spacing) x 3 (absolute spacing) x 2 (RI: immediate / 1-week delayed) ANOVA. As some items were answered correctly on the receptive pretest (Table 33), the pretest scores were subtracted from the posttest scores, and gains were analysed when examining the receptive test results. Table 47 shows the results of the ANOVAs. The table indicates the following four things. First, the main effect of RI was significant on both productive and receptive tests with both scoring procedures. In other words, the delayed posttest scores were significantly lower than the immediate posttest scores. Second, the ANOVA showed a significant main effect of relative spacing with sensitive scoring on the receptive posttest. This means that when collapsed across the three experimental groups and the RIs, participants made significantly more gains on this measure in the expanding schedule than in the equal schedule. However, only a small effect size was observed (partial η2 = .05, Δ = 0.12), and the difference in the mean gains was also small (6.43 for expanding and 6.15 for equal spacing). The 95% confidence intervals of difference were also narrow [0.03, 0.53]. Taken together, the results seem to suggest that although statistically significant, the difference might not have been of substantive (Kline, 2004) or practical significance (Kirk, 1996). Third, neither the interaction between relative and absolute spacing nor the interaction between relative spacing, absolute spacing, and the RI was significant on any of the dependent variables. These
260 non-significant interactions suggest that contrary to the view that the expanding retrieval effect may be observed when absolute spacing is relatively long (see 4.2.2), expanding spacing failed to outperform equal spacing regardless of absolute spacing. Fourth, the interaction between relative spacing and the RI was significant with strict scoring on the productive posttest and with strict and sensitive scoring on the receptive posttest.
261
Table 46 Number of Correct Responses on the Posttests Immediate posttest Strict
Absolute Posttests
spacing
Productive
Short
Receptive
Delayed posttest
Sensitive
Strict
Collapsed across the RIs
Sensitive
Strict
Sensitive
Equal
Expanding
Equal
Expanding
Equal
Expanding
Equal
Expanding
Equal
Expanding
Equal
Expanding
M
5.59
5.72
6.97
7.09
2.00
1.91
3.25
3.38
3.80
3.81
5.11
5.23
(n = 32)
SD
2.71
2.41
2.28
2.12
1.78
1.87
2.29
2.28
2.04
1.83
2.04
1.96
Medium
M
6.25
6.56
7.38
7.44
3.06
2.56
4.19
4.44
4.66
4.56
5.78
5.94
(n = 32)
SD
2.92
2.17
2.67
2.14
2.71
1.97
2.73
2.33
2.56
1.73
2.41
1.76
Long
M
5.31
5.09
6.41
6.47
3.13
2.66
4.09
3.97
4.22
3.88
5.25
5.22
(n = 32)
SD
2.78
2.40
2.41
1.88
2.34
2.39
2.37
2.12
2.31
2.01
2.18
1.66
Total
M
5.72
5.79
6.92
7.00
2.73
2.38
3.84
3.93
4.22
4.08
5.38
5.46
(n = 96)
SD
2.80
2.38
2.47
2.07
2.34
2.09
2.48
2.26
2.32
1.87
2.21
1.81
Short
M
6.09
6.00
6.31
6.28
5.25
5.50
5.31
5.72
5.67
5.75
5.81
6.00
(n = 32)
SD
2.39
2.42
2.44
2.54
2.58
2.36
2.63
2.45
2.30
2.21
2.36
2.31
Medium
M
6.53
6.75
6.84
7.03
5.66
6.72
5.88
6.78
6.09
6.73
6.36
6.91
(n = 32)
SD
2.48
2.70
2.48
2.71
2.51
2.47
2.62
2.49
2.21
2.31
2.24
2.29
Long
M
6.94
7.00
7.06
7.16
6.47
6.50
6.56
6.75
6.70
6.75
6.81
6.95
(n = 32)
SD
2.31
1.85
2.34
1.90
2.50
2.48
2.59
2.42
2.15
1.84
2.24
1.82
Total
M
6.52
6.58
6.74
6.82
5.79
6.24
5.92
6.42
6.16
6.41
6.33
6.62
(n = 96)
SD
2.39
2.36
2.42
2.41
2.55
2.47
2.64
2.48
2.24
2.16
2.29
2.18
Note. The maximum score is 10 for each cell.
262
Table 47 Results of Three-Way ANOVAs for the Posttest Scores Strict scoring Sensitive scoring 2 Posttests Effects df F p partial η df F p partial η2 Productive Relative spacing 1 0.76 .386 .01 1 0.33 .569 .00 Relative X Absolute 2 0.43 .649 .01 2 0.16 .854 .00 RI 1 225.33 .000 .71 1 219.03 .000 .70 RI X Relative 1 4.54 .036 .05 1 0.00 1.000 .00 RI X Relative X Absolute 2 0.93 .399 .02 2 0.29 .750 .01 Receptive Relative spacing 1 3.88 .052 .04 1 5.10 .026 .05 Relative X Absolute 2 2.29 .107 .05 2 0.95 .391 .02 RI 1 7.39 .008 .07 1 9.26 .003 .09 RI X Relative 1 4.17 .044 .04 1 4.44 .038 .05 RI X Relative X Absolute 2 1.80 .171 .04 2 0.84 .437 .02 Note. The main effect of absolute spacing and the interaction between absolute spacing and the RI are not reported here because they are not relevant to the effects of relative spacing.
263 As the interaction between relative spacing and the RI proved significant, the simple main effect of relative spacing was tested to investigate where the significant differences lay. The results of the simple main effect tests are summarised in Table 48. For instance, the table shows that when collapsed across the three experimental groups, with strict scoring on the immediate productive posttest, the difference between equal and expanding spacing was not statistically significant (p = .703), and the effect size was very small (Δ = 0.03) with the 95% confidence intervals of difference being [-0.45, 0.31]. Table 48 also indicates the following three things. First, when collapsed across all experimental groups, with strict scoring on the delayed productive posttest, participants scored higher in the equal schedule (M = 2.73) than in the expanding schedule (M = 2.38; Table 46). Yet, the difference fell short of significance (p = .062), and only a very small effect size was observed (Δ = 0.17). Second, on the delayed receptive posttest, expanding spacing significantly outperformed equal spacing with both scoring methods. However, despite statistical significance, only very small effect sizes were found (0.17 < Δ < 0.18), and the difference in the means between expanding (6.08 with strict and 6.23 with sensitive scoring) and equal spacing (5.65 with strict and 5.74 with sensitive scoring) was small. These findings are also supported by the relatively narrow confidence intervals of difference: [-0.75, -0.13] with strict and [-0.80, -0.18] with sensitive scoring. Third, the differences were not statistically significant for all other comparisons, producing very small effect sizes (0.02 < Δ < 0.04). Overall, although statistically significant effects were found, given the small effect sizes and narrow confidence intervals of
264 difference, it might be reasonable to assume that relative spacing had little effect on posttest results irrespective of absolute spacing or the RI. The statistical significance may be partially due to the relatively large cell size (96; Kline, 2004; Norris & Ortega, 2000).
Table 48 Results of Simple Main Effect of Relative Spacing partial 95% CI of Δ 2 η difference Productive Strict Immediate 0.15 .703 .00 0.03 [-0.45, 0.31] 1 week 3.58 .062 .04 0.17 [-0.02, 0.73] Sensitive Immediate 0.23 .635 .00 0.03 [-0.43, 0.26] 1 week 0.22 .637 .00 0.04 [-0.43, 0.27] Receptive Strict Immediate 0.11 .743 .00 0.02 [-0.37, 0.26] 1 week 7.74 .007 .08 0.17 [-0.75, -0.13] Sensitive Immediate 0.21 .652 .00 0.03 [-0.39, 0.25] 1 week 9.76 .002 .09 0.18 [-0.80, -0.18] Note. df = (1, 95). Effect sizes (Δ) of 0.20, 0.50, and 0.80 are indicative of small, medium, and large effects, respectively (Cohen, 1988). 95% CI of difference = 95% confidence intervals of difference. Posttests
Scoring
RI
F
p
Questionnaire In the questionnaire given after the immediate posttest, the three experimental groups were asked to evaluate the usefulness of the equal and expanding schedules for learning on a 7-point scale, where 1 means Not helpful at all and 7 means Very helpful. The average rating (SDs in parentheses) was 5.16 (1.08), 5.09 (1.30), 4.91 (1.78), 4.75 (1.88), 4.55 (1.59), and 4.65 (1.40) for the short equal, short expanding, medium equal, medium expanding, long equal, and long expanding schedules, respectively.10 When collapsed across the absolute spacing groups, the average rating was 4.87 (1.52) for 10
One participant in the long spacing group did not provide responses.
265 equal spacing and 4.83 (1.55) for expanding spacing. The responses were entered into a two-way mixed design 2 (relative spacing) x 3 (absolute spacing) ANOVA. None of the main effects or interaction was significant, F (1, 92) = 0.14, p = .706, partial η2 < .01 for the main effect of relative spacing, F (2, 92) = 1.07, p = .349, partial η2 = .02 for the main effect of absolute spacing, and F (2, 92) = 0.47, p = .627, partial η2 = .01 for the interaction between the two variables. The results are also supported by the small effect sizes (partial η2 < .02). The findings indicate that (a) no statistically significant difference existed between equal and expanding spacing in the ratings, and (b) the three experimental groups did not differ significantly from each other in their responses regarding the effects of equal and expanding spacing.
4.7 Discussion 4.7.1 Effects of absolute spacing The first purpose of this study was to test the view that the optimal ISI may be determined relative to the RI. The results of this study did not support this view. On the immediate posttest, the mean ISI/RI ratio (SDs in parentheses) was 8.9% (0.7%), 21.8% (11.9%), and 91.3% (15.9%) in the short, medium, and long spacing groups, respectively (see 4.6.1). If we assume that there exists an inverted U-shaped relationship between the ISI and retention (Figure 14) and that the optimal ISI falls within 10 - 30% of the RI (see 4.1.2), we will be able to predict the following order on the immediate posttest: medium > short > long. The present study, however, did not show any statistically significant difference among the three experimental treatments
266 on the immediate posttest. On the delayed posttest, the average ISI/RI ratio was 0.01% (0.00%), 0.02% (0.00%), and 0.06% (0.01%) in the short, medium, and long spacing groups, respectively (see 4.6.1). If we assume that increasing absolute spacing increases retention sharply when the ISIs are shorter than the optimal ISI (Finding 2 in Figure 14), we will be able to predict the following order on the delayed posttest: long > medium > short. Yet, no statistically significant difference was found among the three experimental groups on the delayed posttest either.
Although no significant difference existed among the three experimental groups in their posttest scores, medium and long spacing may be slightly more desirable than short spacing for four reasons. First, the significant interaction between absolute spacing and the RI on the productive posttest suggests that the effects of short spacing may not be as durable as those of medium or long spacing. Second, whereas medium and long spacing significantly outperformed massed learning on all posttests, the difference between short spacing and massed learning reached significance only on the immediate productive posttest. Third, when collapsed across the RIs, the medium group scored higher than the short group with strict scoring on the productive posttest, and the difference showed a tendency towards significance. Fourth, when collapsed across the RIs, the long group fared better than the short group with strict scoring on the receptive posttest, and the difference showed a tendency towards significance. Although inconclusive, based on the above four reasons, medium and long spacing may be recommended over short spacing. While there was no significant difference
267 between medium and long spacing in their effectiveness, medium spacing may be slightly more preferable because it produced significantly more correct responses during retrieval practice than long spacing. As incorrect responses during learning may potentially demotivate learners (Fritz, 2010; Logan & Balota, 2008; Mondria & Mondria-de Vries, 1994), medium spacing may have a positive effect on learners’ motivation.
If we assume that medium spacing was the most desirable among the three experimental treatments, a prescriptive conclusion may be that learners should use a mean ISI of around 2 minutes (117.00 seconds in medium spacing) in flashcard learning. Although this may serve as a rough guideline, there may be at least three limitations to this recommendation. First, the present study did not examine the effects of mean ISIs that are longer than 355.91 seconds. If a longer ISI such as 10 minutes had been used, it might have fared significantly better than a 2-minute ISI. Second, the ISI-RI interaction predicts that the longer the RI becomes, the longer the optimal ISI is (4.1.1). Therefore, although an ISI of 2 minutes may be desirable 1 week after the treatment, it may not necessarily be effective at longer RIs. Third, although this study suggested that an ISI of around 2 minutes may be desirable, the optimal ISI may be affected by a number of factors such as the difficulty of target words or learners’ memory capacity (Finley et al., 2011; Fritz, 2010; Karpicke & Roediger, 2007a; Maddox et al., 2011; Pavlik & Anderson, 2008), which awaits future research.
268
Whereas no statistically significant difference existed among short, medium, and long spacing, massed learning turned out to be the least effective regardless of the posttest, scoring system, or RI (see Karpicke & Bauernschmidt, 2011; Siegel & Misselt, 1984, for similar findings). When correct spelling was required, medium and long spacing were more than twice as effective as massed learning on the delayed productive posttest. Although massed learning required the least study time, it was no more efficient than spaced learning either. The findings seem to underscore the importance of spacing in vocabulary learning. Considering that learners are often unaware of the benefits of spacing (Hartwig & Dunlosky, 2011; Son & Simon, 2012; Wissman et al., 2012), it may be useful to raise awareness that spacing increases learning. One reason why learners prefer massed learning may be that it produces a high probability of retrieval success during learning, which may create what Kornell (2009) refers to as ‘an illusion of effective learning’ (p. 1302) and cause learners to overestimate retention. The results of the questionnaire, however, indicated that learners may be aware of the benefits of spacing. The questionnaire showed that the participants perceived short and medium spacing to be significantly more effective than massed learning (see 4.6.1). One caveat to be considered is that absolute spacing was a between-participant variable in the current study. Because each participant experienced only one of the four absolute spacing conditions, it may not necessarily be valid to compare the ratings for the four schedules.
269 Another pedagogical implication of this study is that learning phase performance may not necessarily be a good index of long-term retention (e.g., Bjork, 1994, 1999; Pashler et al., 2003; Schmidt & Bjork, 1992; Schneider et al., 2002). Although massed learning led to the best learning phase performance, it turned out to be the least effective on both the immediate and delayed posttests. The results seem to be consistent with the desirable difficulty framework (Bjork, 1994, 1999; Schmidt & Bjork, 1992), according to which a treatment that increases initial rate of acquisition does not always enhance long-term retention (see 3.1.2). The finding implies that it may be useful to raise awareness that making mistakes during learning is not necessarily a sign of ineffective learning. Massed learning might give learners confidence because it often produces a large number of correct responses during practice. However, learners should be aware that the advantage of massed learning tends to disappear over time. Spaced learning, in contrast, may appear ineffective in the short term because it may lead to unsuccessful learning phase performance. Yet, learners should not be discouraged or demotivated because despite a large number of errors during learning, spaced learning often yields superior long-term retention to massed learning (e.g., Cepeda et al., 2006; Janiszewski et al., 2003). For research methodology, the lack of a positive correlation between learning phase and posttest performance suggests that it might not be valid to use learning phase performance such as trials to criterion (e.g., Higa, 1963; Tinkham, 1993, 1997; Waring, 1997b) as an index of long-term retention (Bjork, 1994, 1999; Ellis, 1995; see 2.3.7).
270 Theoretical implications Let us now consider the theoretical implications of this study. In the current study, no statistically significant difference existed among the following three mean ISI/RI ratios on the immediate posttest: 8.9% (short), 21.8% (medium), and 91.3% (long; see 4.6.1). The lack of significant differences between medium and long spacing might have been caused because memory performance tends to decrease only gradually beyond the optimal range (Finding 3 in Figure 14). The difference between short and medium spacing was not statistically significant either. This may be partly because the ISI/RI ratio of short spacing (8.9%) was close to the optimal range of 10 - 30%. As noted in 4.1.2, the optimal ISI/RI ratio may depend on various factors such as the RI (Cepeda et al., 2008), task complexity (Bird, 2010; Donovan & Radosevich, 1999), or type of posttest (Cepeda et al., 2008). As a result, it may be possible that the optimal ISI/RI ratio in this study was lower than 10%, and the mean ISIs of both short and medium groups might have fallen within or around the optimal range. Alternatively, the lack of significant differences among the experimental groups may be partly ascribed to the relatively large variations between individuals as indicated by the SDs on the posttest scores.
On the delayed posttest, no statistically significant difference was found among the following three mean ISI/RI ratios either: 0.01%, 0.02%, and 0.06%. The finding is inconsistent with the lag effect, according to which larger absolute spacing generally leads to better long-term retention than shorter spacing (see 4.1.1). The lack of
271 significant differences on the delayed posttest may be partly ascribed to the rather narrow range of the ISI/RI ratios (0.01%, 0.02%, and 0.06%). In other words, depending on the range of ISIs, larger absolute spacing may not necessarily lead to better long-term retention than shorter spacing. For instance, although an ISI of 1 day may be significantly more effective than an ISI of 5 minutes (Cepeda et al., 2009, Experiment 1), there may be only a small difference between the effects of 5-minute and 10-minute ISIs (see Crothers & Suppes, 1967, Experiment 11; Pyc & Rawson, 2007, Experiment 2, for similar findings; 4.1.2).
Whereas the present study did not support the lag effect, a robust spacing effect was observed. While massed learning led to the best learning phase performance, it turned out to be the least effective on both the immediate and delayed posttests. The findings are congruent with the spacing effect (Cepeda et al., 2006; Dempster, 1989, 1996; Janiszewski et al., 2003; see 4.1.1). Short spacing produced a smaller spacing effect than medium and long spacing: Although short spacing was significantly more effective than massed learning on the immediate productive posttest, the difference between the two did not reach statistical significance on the other posttests. The results may be in part explained by the ISI-RI interaction (see 4.1.1). Due to the ISI-RI interaction, the advantage of short spacing over massed learning might have disappeared on the delayed posttest. Whereas short spacing significantly outperformed massed learning on the immediate productive posttest, it failed to do so on either the immediate or delayed receptive test. The results may be partially due to
272 the test order. At each test administration, the productive posttest was given prior to the receptive test (see 4.5.8). As correct responses in the latter were used as cues in the former, the productive test might have affected performance on the receptive test, possibly diminishing a potential difference between the two treatments.
4.7.2 Effects of relative spacing The second research question in this study asked whether expanding spacing is more effective than equal spacing when (a) the treatment involves long absolute spacing, (b) the RI is shorter than 24 hours, (c) feedback is provided after retrievals, (d) the treatment involves productive recall, and (e) no time limit is imposed for retrieval practice. The ANOVA showed a significant main effect of relative spacing with sensitive scoring on the receptive posttest. The interaction between relative spacing and the RI was significant with strict scoring on the productive posttest and with strict and sensitive scoring on the receptive posttest. Yet, only small effect sizes were found, and the confidence intervals of difference were also narrow. None of the other interactions involving relative spacing were significant. Based on the results, it might be reasonable to assume that relative spacing had little effect on vocabulary learning irrespective of the RI or absolute spacing. The statistical significance may be partially due to the relatively large cell size (96; Kline, 2004; Norris & Ortega, 2000). The results imply that when learning from flashcards, learners may study with either equal or expanding spacing. The finding may translate well to paper-based flashcard learning, where expanding spacing may be relatively hard to implement (however, see
273 Mondria & Mondria-de Vries, 1994, for recommendations on how to implement expanding spacing using paper-based flashcards). Equal spacing may be used in paper-based flashcard learning because it may be as effective as expanding spacing and easier to implement manually. Together with the findings related to the first research question, the results suggest that absolute spacing may have a larger effect on L2 vocabulary learning than relative spacing (Karpicke & Bauernschmidt, 2011). Pedagogically, the findings indicate that introducing a large amount of spacing may be more important than gradually increasing spacing.
Theoretical implications Let us now consider the theoretical implications of this study. Earlier studies demonstrate that expanding spacing may be more effective than equal spacing when (a) the treatment involves long absolute spacing, (b) the task difficulty is high, (c) the RI is shorter than 24 hours, and / or (d) feedback is not provided (see 4.2.2). The present study, however, seems to indicate that there may be little difference between equal and expanding spacing even though the first three of the above four conditions are met. First, neither the interaction between relative and absolute spacing nor the interaction between relative spacing, absolute spacing, and the RI was significant on any of the dependent variables. The results suggest that expanding spacing failed to significantly outperform equal spacing irrespective of absolute spacing. The finding is inconsistent with the observation that expanding spacing may be particularly effective when the treatment involves long absolute spacing (Dobson, 2012; Maddox et al.,
274 2011). One may argue that the lack of a significant interaction between relative and absolute spacing may be partially attributed to the rather narrow range of the mean ISIs used in the current study (58.97 - 355.91 seconds; see 4.6.1). Yet, as the average ISI used in this experiment was as long as or longer than in all previous studies on paired-associate learning that have found an advantage of expanding spacing (5 - 60 seconds; see 4.5.4), it might be expected that similar results might have been obtained. Kang et al. (2013) also failed to find any benefit of expanding over equal spacing in their posttest scores although they used a rather long mean ISI of 9 days (see 4.2.3).
Second, prior studies suggest that a significant expanding retrieval effect tends to be observed when the task difficulty is high (see 4.2.2). The present study differed from the earlier L2 vocabulary studies in two factors that may affect task difficulty: the format and pacing of retrieval practice. First, unlike the existing L2 vocabulary studies, which used a receptive recall format during learning (see 4.2.3), a productive recall format was used in the current study. Because productive retrieval is more demanding than receptive retrieval (see 3.1.1), the use of productive recall might have increased task difficulty. Second, unlike the earlier studies, retrieval practice was self-paced, and no time limit was imposed in the present study. This manipulation might have potentially decreased task difficulty (Nation, 1982, 2001, p. 305). However, the response latency in this study was not considerably longer compared with the previous L2 vocabulary studies (see 4.6.1). Self-pacing of retrieval practice, therefore, might not have significantly affected task difficulty. Taken together, the task
275 difficulty in this study was perhaps slightly higher than in the existing L2 vocabulary studies because (a) target words were practised in productive recall instead of receptive recall, and (b) the response latency was roughly the same as in the previous studies although retrieval practice was self-paced. Despite the potentially higher task difficulty, the present study found little difference between equal and expanding spacing in their effectiveness. The results could be taken as further evidence indicating that expanding spacing might have only little facilitative effect.
Third, the results of this study were also at odds with the finding that expanding spacing may be particularly effective when the RI is shorter than 24 hours (see 4.2.2). The interaction between relative spacing and the RI was significant with strict scoring on the productive posttest and with strict and sensitive scoring on the receptive posttest. However, given the small effect sizes and narrow confidence intervals of difference, it might be reasonable to assume that the RI had little influence on the effects of relative spacing. Although the findings of this study were incongruent with previous studies regarding the effects of absolute spacing, task difficulty, and RI, they supported the feedback hypothesis (see 4.2.2). Taken together, the present and previous studies may indicate that as long as the treatment involves feedback, there may be little difference between equal and expanding spacing irrespective of absolute spacing, task difficulty, and RI.
The findings of the current study add to the existing literature suggesting that
276 expanding spacing might not be the optimal relative spacing schedule (e.g., Cepeda et al., 2006, Kang et al., 2013; Karpicke & Bauernschmidt, 2011; Pyc & Rawson, 2007; Roediger & Karpicke, 2010). A unique contribution of this study is that the results indicated that when feedback is provided, expanding spacing may not necessarily enhance L2 vocabulary learning even though the treatment involves long absolute spacing, task difficulty is high, and the RI is shorter than 24 hours, the conditions in which the expanding retrieval effect has traditionally been observed (see 4.2.2). There remains a possibility that expanding spacing may fare significantly better than equal spacing when feedback is not provided. Although this possibility may be of theoretical importance, it may have limited pedagogical value because not only does feedback enhance learning but also authentic flashcard learning typically involves feedback (see 4.2.3).
4.8 Limitations One limitation of the present study may be the rather narrow range of ISI/RI ratios used. This may be in part responsible for the lack of significant differences among short, medium, and long spacing. Future research may use a wider range of ISI/RI ratios. Second, although the optimal spacing schedule may be affected by factors such as the difficulty of target words or learners’ memory capacity (Finley et al., 2011; Fritz, 2010; Karpicke & Roediger, 2007a; Maddox et al., 2011; Pavlik & Anderson, 2008), individual or item differences were not dealt with in the current study. It may be useful to examine the effects of absolute and relative spacing while also taking
277 these variables into consideration (see Pavlik & Anderson, 2008, for an example).
278
279 Chapter 5. STUDY 4: RETRIEVAL FREQUENCY AND FEEDBACK TIMING The purpose of the present study was to examine the effects of retrieval frequency and timing of feedback on L2 vocabulary learning. Retrieval frequency refers to the number of retrieval attempts in flashcard learning. For instance, if learners practise retrieval five times, the retrieval frequency is five. The timing of feedback is concerned with when to provide feedback for retrieval. Feedback has been categorised into two types: immediate and delayed (e.g., Butler et al., 2007; Kulik & Kulik, 1988; Metcalfe et al., 2009). The former is typically given immediately after each response, whereas the latter is provided after a number of items or a period of time (see 5.2).
Given that retrieval increases learning (see 1.1), how many times should learners practise retrieval in order to learn L2 vocabulary from flashcards? Although previous studies have investigated the relationship between the retrieval frequency and learning, they are limited in that they (a) examined only a limited range of retrieval frequency levels (Karpicke & Roediger, 2007a, 2007b; Logan & Balota, 2008; Rohrer et al., 2005; Tulving, 1967; Zaromb & Roediger, 2010), (b) did not provide feedback after retrievals (Karpicke & Roediger, 2007b, Experiments 1 & 3; Logan & Balota, 2008), (c) did not use a retention interval (RI) greater than 1 week (Karpicke & Roediger, 2007a, 2007b, 2008; Logan & Balota, 2008; Pyc & Rawson, 2007, 2011; Tulving, 1967; Zaromb & Roediger, 2010), (d) associated higher retrieval frequency with lower exposure frequency (Karpicke & Roediger, 2007b; Tulving, 1967; Zaromb & Roediger, 2010), and / or (e) possibly confounded the retrieval frequency with the
280 item difficulty (Karpicke & Roediger, 2008; Pyc & Rawson, 2007, 2011). This study investigated the effects of retrieval frequency on L2 vocabulary learning while addressing the above limitations. Findings of this study may be useful because they may allow us to identify the optimal retrieval frequency for L2 vocabulary learning from flashcards.
As for the timing of feedback, empirical studies have yielded mixed results regarding the effects of immediate and delayed feedback (e.g., Butler et al., 2007; Kulik & Kulik, 1988; Metcalfe et al., 2009). Furthermore, previous studies on feedback timing may suffer from at least three limitations. First, feedback timing was confounded with lag to test in some earlier studies (e.g., Berlyne, 1966; Butler et al., 2007; Haynes, 1974; Kulhavy & Anderson, 1972; O’Neill, Rasor, & Bartz, 1976; Surber & Anderson, 1975; Swindell & Walls, 1993). These results, therefore, may be at least partly attributed to lag to test rather than feedback timing per se (see 5.2 for details). Second, previous research suggests that delayed feedback may be particularly effective when the treatment induces only few errors of commission, or incorrect responses (Metcalfe et al., 2009; Note that hereafter, errors refer only to incorrect responses [i.e., errors of commission] and do not include blank responses [i.e., errors of omission]). This suggests that in order to obtain a comprehensive picture regarding the effects of immediate and delayed feedback, it may be useful to manipulate the frequency of errors during learning. Yet, a possible relationship between the feedback timing effect and learning phase performance has not been explored thoroughly in the existing
281 literature. Third, none of the existing feedback timing studies looked into the learning of L2 vocabulary. Due to the inconsistent findings and limitations of the previous research, it is not clear whether immediate or delayed feedback is more effective for L2 vocabulary learning. With the above discussion in mind, the present study also compared the effects of immediate and delayed feedback on L2 vocabulary learning while controlling the lag to test and manipulating the frequency of errors during learning. The results of this study may allow us to identify the optimal feedback timing for flashcard learning.
In the present study, participants were assigned to one of the four retrieval frequency levels: one, three, five, and seven. The timing of feedback was manipulated within participants, and immediate and delayed feedback were compared. In the former, feedback was provided immediately after each retrieval attempt. In the latter, feedback was withheld until all target items were practised. Learning was measured by productive and receptive recall posttests administered at three RIs: immediately, 1 week, and 4 weeks after the treatment. Findings of this study may allow us to identify the optimal retrieval frequency and feedback timing for L2 vocabulary learning from flashcards.
5.1 Effects of Retrieval Frequency Previous studies indicate that retrieval frequency may affect learning. Karpicke and Roediger (2007a), for instance, investigated the effects of retrieval frequency in three
282 experiments. In their experiments, American undergraduate students studied low-frequency English words paired with higher frequency synonyms (e.g., sobriquet - nickname). Target items were assigned to one of the following two conditions: repeated test and single test. In the repeated test condition, the target words were practised in a receptive recall format three (Experiments 1 and 2) or four times (Experiment 3) during the treatment, whereas in the single test condition, they were practised only once. (Note that a test here refers to retrieval practice given as part of the treatment, not the posttest.) Although feedback was provided after each response in Experiment 2, no feedback was given in Experiments 1 and 3. Learning was measured by a receptive recall posttest administered at two RIs: 10 minutes and 2 days after the treatment. In Experiment 1, the repeated test condition (39%) outperformed the single test condition (26%) on the 2-day delayed posttest although there was little difference on the immediate posttest (repeated test: 66%, single test: 61%). In Experiment 2, the repeated test condition fared significantly better than the single test condition on both the immediate (repeated test: 89%, single test: 66%) and delayed posttests (repeated test: 55%, single test: 30%). In Experiment 3, Karpicke and Roediger (2007a) also found a significant advantage of the repeated test condition on the immediate (repeated test: 63%, single test: 44%) and the delayed posttests (repeated test: 48%, single test: 30%). Logan and Balota (2008) also found that three retrievals were significantly more effective than one when learning weakly associated L1 word pairs (e.g., sheep - cloth).
283 Two experiments conducted by Rohrer et al. (2005), however, indicate that although repeated retrieval may be effective at a relatively short RI (e.g., 1 week), its advantage may disappear over time. In their first experiment, 130 American undergraduate students studied 10 city-country pairs. At the beginning of the treatment, participants were presented with a name of the city and the country where the city is located (e.g., Chiba - Japan). After the initial presentation, participants were presented with a city name (e.g., Chiba) and asked to produce the corresponding country name (e.g., Japan). The participants were randomly assigned to the Retrieval 5 and 20 groups. The retrieval frequency was five for the former and 20 for the latter. Feedback was provided after retrievals. The RI was manipulated between participants, and the participants took the posttest 1, 3, or 9 weeks after the treatment. Rohrer et al. found that although the Retrieval 20 group significantly outperformed the Retrieval 5 group 1 week after the treatment, there was no statistically significant difference between the two groups on the 3- and 9-week delayed posttests.
In Rohrer et al.’s (2005) Experiment 2, 88 American undergraduate students studied low-frequency English words paired with higher frequency synonyms (e.g., acrogen – fern). The participants were divided into two groups: Retrieval 5 and 10. The RI was manipulated between participants, and the participants took the posttest 1 or 4 weeks after the treatment. While the Retrieval 10 group (M = 64%) fared significantly better than the Retrieval 5 group (M = 38%) on the 1-week delayed posttest, no statistically significant difference existed between the two groups 4 weeks after the treatment
284 (Retrieval 5: 18%; Retrieval 10: 22%). Based on their findings, Rohrer et al. argue that the benefits of repeated retrieval may be short lived and their results suggest that when examining the effects of retrieval frequency, it may be useful to give a posttest at a relatively long RI (e.g., 3 or 4 weeks).
Although the studies reviewed above used only two levels of retrieval frequency (e.g., one and three in Logan & Balota, 2008; five and 10 in Rohrer et al., 2005, Experiment 2), Karpicke and Roediger (2007b, Experiment 1), Tulving (1967), and Zaromb and Roediger (2010) investigated the effects of three retrieval frequency levels. In Karpicke and Roediger (2007b, Experiment 1), 60 American undergraduate students were asked to remember 40 L1 known words. The participants were randomly assigned to the standard, repeated study, and repeated test groups. The treatment consisted of a series of study and test (retrieval practice) phases. In the study phase, each target word was presented for 3 seconds. In the test phase, participants were given 2 minutes and asked to write down as many target items as they remembered. Both the numbers of test and study phases were manipulated. The standard group had 10 test phases and 10 study phases, the repeated test group had 15 test phases and five study phases, and the repeated study group had five test phases and 15 study phases.
On the posttest conducted 1 week after the treatment, the standard (68%) and repeated test groups (64%) significantly outperformed the repeated study group (57%). The result may suggest that 10 and 15 retrievals may be more effective than five. However,
285 Karpicke and Roediger’s (2007b) finding may need to be interpreted with caution because although the repeated test group had more test phases (15) than the standard (10) and repeated study groups (5), it had fewer study phases (5) than the standard (10) and repeated study groups (15). Because of this confound, the result of their study may be at least partly attributed to the exposure frequency (number of study phases) rather than the retrieval frequency (number of test phases) per se. Studies conducted by Tulving (1967) and Zaromb and Roediger (2010) also have a similar confound.
L2 vocabulary studies on dropout schedules L2 vocabulary studies investigating the effects of dropout schedules (Karpicke & Roediger, 2008; Pyc & Rawson, 2007, 2011) may also have implications for the optimal retrieval frequency. A dropout schedule refers to a schedule where target items are excluded from the treatment after they reach a certain criterion (e.g., one or two correct recalls). In Karpicke and Roediger (2008), for instance, 40 American undergraduate students studied 40 Swahili-English word pairs. The treatment consisted of four sets of alternating study and test phases (i.e., STSTSTST; where S denotes a study phase and T denotes a test phase). In the study phase, the Swahili and English words were presented simultaneously for 5 seconds per word pair. In the test phase, target items were practised in a receptive recall format. Each item appeared only once in each study or test phase. In the fixed condition, target items were studied and tested four times during the treatment. In the dropout condition, target items were
286 dropped out of the treatment after one correct recall. As a result, in the dropout condition, the retrieval frequency was different for each item for each participant. For instance, the retrieval frequency might have been one for some items while it might have been four for others. When averaged over participants and items, target items were studied and tested 1.94 times in the dropout condition as opposed to four times in the fixed condition. Although feedback was not provided during the test phase, participants were presented with the target word pairs in the next study phase. As a result, the study phase effectively functioned as delayed feedback. On the posttest administered 1 week after the treatment, the fixed condition (M = approximately 80%) significantly outperformed the dropout condition (M = 33%).11 Their result seems to suggest that when learning L2 vocabulary, four retrievals may be more effective than approximately two (1.94) retrievals. The relatively high posttest score in the fixed condition (approximately 80%) also indicates that when the RI is 1 week and feedback is delayed, four retrievals might be sufficient for acquiring receptive vocabulary knowledge.
Even though the findings of Karpicke and Roediger (2008) are very valuable, their results may need to be interpreted with caution because the retrieval frequency may have been confounded with the item difficulty. Specifically, because target items were dropped out of the treatment after one correct recall in the dropout condition, difficult items might have been practised more often than easy items. For instance, items with 11
The exact mean score in the fixed condition is not available because it is described merely as ‘about 80%’ (Karpicke & Roediger, 2008, p. 967).
287 a low learning burden might have been practised only once because they reached the criterion after one retrieval attempt. In contrast, the retrieval frequency might have been four for difficult items because they never reached the criterion. Because of this possible confound of the retrieval frequency with the item difficulty, Karpicke and Roediger (2008) may not necessarily allow us to identify the optimal retrieval frequency for L2 vocabulary learning from flashcards. Studies conducted by Pyc and Rawson (2007, 2011) also have a similar confound.
Limitations of previous studies Even though the findings of the previous studies are very valuable, they may suffer from at least five limitations. First, because existing studies used only two (Karpicke & Roediger, 2007a; Logan & Balota, 2008; Rohrer et al., 2005) or three levels of retrieval frequency (Karpicke & Roediger, 2007b; Tulving, 1967; Zaromb & Roediger, 2010), they may not necessarily allow us to obtain a comprehensive picture regarding the effects of retrieval frequency. Second, feedback was not provided after retrievals in some studies (Karpicke & Roediger, 2007a, Experiments 1 & 3; Logan & Balota, 2008). Thus, it is unclear to what extent these findings may be applicable to authentic flashcard learning, which typically involves feedback (e.g., Karpicke & Bauernschmidt, 2011; Karpicke & Roediger, 2007a; Karpicke, Smith, et al., 2009). Third, Rohrer et al. (2005) showed that although repeated retrieval may be effective at a 1-week RI, its advantage may not be observed if a posttest is given at an RI of 3 weeks or greater. Their findings suggest that it may be useful to give a posttest at a
288 relatively long RI when examining the effects of retrieval frequency. Yet, none of the existing studies except Rohrer et al. (2005) used an RI greater than 1 week. Fourth, higher retrieval frequency was associated with lower exposure frequency in Karpicke and Roediger (2007b), Tulving (1967), and Zaromb and Roediger (2010). As a result, it is not possible to isolate the effects of retrieval and exposure frequency in these studies. Fifth, in the studies on dropout schedules (Karpicke & Roediger, 2008; Pyc & Rawson, 2007, 2011), the retrieval frequency was different for each item for each participant, and the retrieval frequency might have been confounded with the item difficulty. Because of this possible confound, these studies may not necessarily allow us to identify the optimal retrieval frequency.
The above discussion suggests that when investigating the effects of retrieval frequency on L2 vocabulary learning, it may be useful to meet the following five conditions: (a) examine the effects of a wide range of retrieval frequency levels, (b) provide feedback after retrievals, (c) give a posttest at an RI greater than 1 week, (d) manipulate the retrieval frequency without associating higher retrieval frequency with lower exposure frequency, and (e) manipulate the retrieval frequency without confounding it with the item difficulty. None of the previous studies, however, meets all of these five conditions. With the limitations of the existing studies in mind, the present study investigated the effects of four levels of retrieval frequency (one, three, five, and seven) on L2 vocabulary learning while meeting all of the above five conditions. Findings of this study may be useful because they may allow us to identify
289 the optimal retrieval frequency for L2 vocabulary learning.
5.2 Effects of Feedback Timing The timing of feedback is regarded as a factor that may affect second language acquisition (DeKeyser, 2007; Long & Richards, 2007) as well as learning in general (e.g., Butler et al., 2007; Carpenter & Vul, 2011; Metcalfe et al., 2009; Mory, 2004; Pashler et al., 2007). Previous studies have examined the effects of two types of feedback: immediate and delayed. The former is typically given immediately after each response, whereas the latter is provided after a number of items or a period of time. (Previous studies, however, differ in the operationalisation of the two types of feedback. See below for details.)
Immediate feedback seems to be used more frequently than delayed feedback in both paper- and computer-based flashcard learning. Karpicke, Smith, et al. (2009), for instance, surveyed 103 American college students and found that when learning from paper-based flashcards, 91% of them confirm the correct answer immediately after retrieval attempts, which is equivalent to receiving immediate feedback. Immediate feedback also appears to be a common feature among existing flashcard software. In all nine flashcard programs surveyed by Nakata (2011), feedback is given immediately after each retrieval attempt. Although immediate feedback may be more common than delayed feedback in a real-life study situation, some researchers argue that delaying feedback may increase learning (e.g., Butler et al., 2007; Carpenter &
290 Vul, 2011; Metcalfe et al., 2009; Mory, 2004; Pashler et al., 2007; Roediger et al., 2010). The superiority of delayed over immediate feedback is referred to as the delay-retention effect (hereafter, DRE; Kulhavy & Anderson, 1972; Metcalfe et al., 2009; Mory, 2004; Sturges, Sarafino, & Donaldson, 1968).
The DRE may be accounted for by the distributed practice effect and interference theory. The distributed practice effect refers to the phenomenon where larger spacing generally leads to better long-term retention than shorter spacing or no spacing at all (Cepeda et al., 2006; see 4.1.1). Because delayed feedback is given after a greater delay than immediate feedback, the distributed practice effect predicts that delaying feedback may facilitate learning (Butler et al., 2007; Metcalfe et al., 2009; Pashler et al., 2007). Interference theory also predicts an advantage of delayed over immediate feedback (Butler et al., 2007; Carpenter & Vul, 2011; Kulhavy & Anderson, 1972; Metcalfe et al., 2009; Mory, 2004). Suppose learners produced an erroneous response in retrieval practice. When the correct response is given immediately after the retrieval attempt, learners may confuse their error with the correct response and might learn false information. In contrast, when feedback is given after a delay, learners’ errors may be forgotten by the time they receive feedback. As a result, their error is less likely to interfere with the correct response, which may increase learning.
The theory of errorless learning (e.g., Skinner, 1954), on the other hand, predicts an advantage of immediate feedback (Butler et al., 2007; Metcalfe et al., 2009; Pashler et
291 al., 2007). According to this theory, immediate feedback may be more effective because unless feedback is given immediately after retrievals, learners’ errors might be consolidated, and learners might learn false information. (Note that predictions based on interference theory and the theory of errorless learning both rest on the assumption that learners make errors during learning. If the treatment induces only few errors, both theories may be irrelevant. See below for details.)
Empirical studies have yielded mixed results regarding the effects of immediate and delayed feedback. Kulik and Kulik (1988), for instance, conducted a meta-analysis of 53 experiments on feedback timing and found that although 26 of them observed the DRE, the other 27 failed to do so. Several explanations have been offered for the inconsistent results. First, Kulik and Kulik (1988) point out that studies conducted in laboratory settings are more likely to observe a significant DRE than those conducted in actual classroom settings. Butler et al. (2007) and Roediger et al. (2010) speculate that the setting of the experiment (laboratory or classroom) may interact with the DRE probably because participants may process feedback differently in laboratory and classroom studies. Specifically, Butler et al. and Roediger et al. point out that delayed feedback may not be studied as carefully as immediate feedback in classroom studies. Laboratory studies, in contrast, usually require learners to study feedback for a fixed amount of time in both immediate and delayed feedback conditions, ensuring that both types of feedback are processed equally carefully. Because learners may pay more attention to delayed feedback in laboratory studies than in classroom studies,
292 laboratory studies may be more likely to produce a significant DRE (Butler et al., 2007; Roediger et al., 2010). A possible interaction between the DRE and experimental settings may partially account for the mixed results of previous studies.
Second, Metcalfe et al. (2009) point out that the effects of immediate and delayed feedback may be conditional upon learning phase performance. As noted above, interference theory predicts an advantage of delayed over immediate feedback, whereas the theory of errorless learning predicts a superiority of immediate feedback. It should be noted that both predictions are based on the assumption that learners make errors (incorrect responses) during learning and may be irrelevant when the treatment induces only few errors (see above). The distributed practice effect, in contrast, may be observed for both correct and incorrect responses. As a result, when learners produce few errors during learning, a significant DRE may be observed due to the distributed practice effect. If the treatment induces many errors, in contrast, the beneficial effects of delaying feedback (larger spacing and less interference) might be offset by the risk of not correcting an error immediately, and a DRE may not be observed (Metcalfe et al., 2009). The inconsistent findings of previous studies may also be attributed in part to a possible interaction between the DRE and learning phase performance.
Two experiments conducted by Metcalfe et al. (2009) suggest that the DRE may interact with the frequency of errors during the treatment. In their Experiment 1, 27
293 American grade school children studied 72 low-frequency English words (e.g., inefficiency, inscription). At the beginning of the treatment, participants were presented with a target word followed by its definition. After all target words were introduced, participants were presented with a definition and asked to type the corresponding target word. Delayed feedback was provided 1 or 4 days after the response, whereas immediate feedback was given on the same day. In Experiment 2, 20 Columbia University students studied 75 low-frequency English words (e.g., complaisant, penumbra). Immediate feedback was given on the same day as the initial treatment session, whereas delayed feedback was given 3.85 days after on average. Metcalfe et al. found a larger DRE in their Experiment 1 than in their second experiment. Metcalfe et al. (2009) argue that the results might have been caused by a difference in the frequency of errors during learning. Participants in their second experiment produced more errors (61% of all responses) during the treatment than in their Experiment 1 (40%). Because delayed feedback may be particularly effective when the treatment induces few errors, Metcalfe et al.'s Experiment 1 might have produced a larger DRE than in their Experiment 2. One limitation of their study, though, is that although their first experiment was conducted with grade school children, the participants in Experiment 2 were university students. As a result, the results of their experiments may be at least partly attributed to the difference in the age of participants rather than differential learning phase performance.
Third, the inconsistent results may be partially due to methodological differences.
294 Previous studies differ in the operationalisation of immediate and delayed feedback (Roediger et al., 2010). For instance, in Carpenter and Vul (2011), immediate feedback was given immediately after each response, and delayed feedback was given 3 seconds after. In Phye and Baller (1970), in contrast, immediate feedback was provided after 30 minutes, and delayed feedback was provided 2 days after the response. This implies that what is classified as immediate feedback in some studies may qualify as delayed feedback in others. The difference in the operationalisation of the two types of feedback may partially account for the mixed results of previous studies. Earlier studies also differ in the materials. The materials used by previous studies include L1 vocabulary, word-number pairs, trigram pairs, L2-trigram pairs, face-name pairs, reading materials, motor skills, programming languages, mathematics, chemistry, and psychology (see Kulik & Kulik, 1988, for a review). These methodological differences could partially be responsible for the inconsistent results of previous studies as well.
Limitations of previous studies Previous feedback timing studies not only report inconsistent results but also suffer from at least three limitations. First, some earlier studies did not control for lag to test. Lag to test refers to an interval between the last encounter with a given item and the test and is shown to affect memory performance (see 2.1.2). Let me illustrate this point by using Butler et al. (2007, Experiment 2) as an example. Butler et al. examined the effects of immediate and delayed feedback on the retention of reading
295 materials. Figure 19 summarises the design of their experiment.
Figure 19. Design of Butler et al.’s (2007) Experiment 2.
As illustrated in the figure, Butler et al.’s (2007) experiment was conducted over a period of 8 days. On Day 1, 40 American undergraduate students read 12 passages and answered multiple-choice comprehension questions. Feedback was given immediately after each response in the immediate feedback condition, while it was provided on Day 2 in the delayed feedback condition. The posttest was administered on Day 8. Butler et al. found that delayed feedback led to a higher posttest score (70%) than immediate feedback (60%). Based on their finding, Butler et al. argue that delaying feedback may increase long-term retention.
Metcalfe et al. (2009), however, point out that Butler et al.’s (2007) finding may be at least partly attributed to lag to test rather than feedback timing per se. Specifically, in the immediate feedback condition in Butler et al., participants received feedback on Day 1, and the posttest was conducted on Day 8. Hence, there was a lag of 7 days between feedback and the posttest. In the delayed feedback condition, however, there was a lag of only 6 days between feedback and the posttest (Figure 19). The
296 confounding of feedback timing and lag to test is problematic because a shorter lag to test generally leads to better memory performance than a longer lag (see 2.1.2). Feedback timing was confounded with lag to test in other existing studies as well (e.g., Berlyne, 1966; Haynes, 1974; Kulhavy & Anderson, 1972; O’Neill et al., 1976; Surber & Anderson, 1975; Swindell & Walls, 1993; J. M. Webb, Stock, & McCarthy, 1994).
Second, as described above, Metcalfe et al. (2009) observe that delayed feedback may be particularly effective when the treatment induces few errors. This suggests that in order to obtain a comprehensive picture regarding the effects of immediate and delayed feedback, it may be useful to manipulate the frequency of errors during learning. Yet, a possible relationship between the DRE and learning phase performance has not been explored thoroughly in the existing literature. Metcalfe et al.'s (2009) study, which is described earlier, constitutes the only exception. However, the results of their study may be at least partly attributed to the difference in the age of participants (i.e., grade school vs. college students) rather than the difference in the proportions of errors per se. With this limitation in mind, the present study manipulated the frequency of errors during the treatment without confounding it with the age of participants. The frequency of errors during learning was manipulated in this study by using four levels of retrieval frequency: one, three, five, and seven. Because repeated retrieval may lead to more successful learning phase performance than fewer retrievals (e.g., practising retrieval seven times may lead to better learning
297 phase performance than practising retrieval three times), manipulating retrieval frequency may allow us to test the view that delayed feedback may be particularly effective when the treatment induces few errors.
Another limitation of the previous studies may be that none of them examined L2 vocabulary learning. Thus, it is unclear to what extent their findings may be applicable to L2 vocabulary learning. With the inconsistent findings and limitations of the existing studies in mind, the present study compared the effects of immediate and delayed feedback on L2 vocabulary learning while controlling the lag to test and manipulating the frequency of errors during learning. The results of this study may allow us to determine which type of feedback should be used in order to optimise flashcard learning.
5.3 The Present Study The purpose of the present study was to identify the optimal retrieval frequency and feedback timing for L2 vocabulary learning. Previous studies on retrieval frequency may be limited in that they (a) examined only a limited range of retrieval frequency levels, (b) did not provide feedback after retrievals, (c) did not use an RI greater than 1 week, (d) associated higher retrieval frequency with lower exposure frequency, and / or (e) possibly confounded the retrieval frequency with the item difficulty. This study investigated the effects of four levels of retrieval frequency (one, three, five, and seven) on L2 vocabulary learning while addressing the above limitations of previous
298 research. Findings of this study may be useful because they may allow us to identify the optimal retrieval frequency for L2 vocabulary learning from flashcards.
As for the timing of feedback, empirical studies have yielded mixed results regarding the effects of immediate and delayed feedback (e.g., Butler et al., 2007; Kulik & Kulik, 1988; Metcalfe et al., 2009). Furthermore, previous studies on feedback timing may suffer from at least three limitations. First, feedback timing was confounded with lag to test in some earlier studies. Second, although previous research suggests that delayed feedback may be particularly effective when the treatment induces few errors, a possible relationship between the feedback timing effect and learning phase performance has not been explored thoroughly in the existing literature. Third, none of the existing feedback timing studies looked into the learning of L2 vocabulary. Due to the inconsistent findings and limitations of the previous research, it is not clear whether a significant DRE may be observed for L2 vocabulary learning. With the above discussion in mind, the present study also compared the effects of immediate and delayed feedback on L2 vocabulary learning while controlling the lag to test and manipulating the frequency of errors during learning. The results of this study may also allow us to determine which type of feedback should be used in order to optimise flashcard learning.
Research questions The following two research questions were addressed in this study:
299 1. What is the optimal retrieval frequency for L2 vocabulary learning from flashcards? 2. Is delayed feedback more effective than immediate feedback for L2 vocabulary learning from flashcards when lag to test is controlled and the treatment induces few errors?
5.4 Pilot Studies Pilot studies were conducted with eight Japanese learners of English in New Zealand and 45 first-year engineering students at a technical college in Gifu, Japan (GTC; see 4.5.1) to identify any potential problems with the methodology of this experiment. Two potential problems were identified in the pilot studies. First, pilot studies suggested a possible floor effect on the delayed posttests. Second, pilot studies also indicated that some participants in the Retrieval 5 and 7 groups might not have been able to complete the experiment within regular class time (90 minutes). In order to lower task difficulty and ensure that the participants would have enough time to complete the experiment, two changes were made to the target items in the main data collection. First, the number of target items was reduced from 20 to 16. Second, items with low posttest scores in the pilot studies were replaced. See 5.5.3 for further details about the target items.
300 5.5 Method 5.5.1 Participants The original pool of participants consisted of 116 first-year engineering students at GTC. Out of the 116 participants, 17 were dropped because they were absent on the day of the experiment, chose not to participate, or their data were lost due to technical problems, leaving 99 students. The participants were randomly assigned to the four retrieval frequency groups: Retrieval 1, 3, 5, and 7. The Retrieval 1, 3, 5, and 7 groups consisted of 24, 24, 24, and 27 participants, respectively. The participants were also randomly divided into Subgroups X and Y (see 5.5.2). In order to counterbalance the effects of target items across participants, each retrieval frequency group needs to consist of an equal number of participants from Subgroups X and Y. Each of the Retrieval 1, 3, and 5 groups consisted of 12 Subgroup X and 12 Subgroup Y participants. However, there was an imbalance in the Retrieval 7 group: It consisted of 13 Subgroup X and 14 Subgroup Y participants. In order to ensure that the Retrieval 7 group would have an equal number of participants (13) from both subgroups, one Subgroup Y participant was excluded from analysis. The participant to be dropped was determined in the same way as in Study 3. Specifically, a participant was excluded so that it would minimise effects on the mean productive posttest scores of the subgroup from which the participant was dropped (see 4.6). After one student was excluded, the Retrieval 1, 3, 5, and 7 groups consisted of 24, 24, 24, and 26 participants, respectively, leaving 98 students in total. The data from these 98 students were analysed in this study.
301
None of the participants exhibited prior knowledge of any of the target words on a productive pretest regardless of the scoring procedure (see 4.5.7 for details about the pretest). It is assumed, therefore, that the four retrieval frequency groups did not differ from one another in terms of their productive knowledge of the target words studied under the immediate and delayed feedback conditions at the outset of the experiment. The results of the receptive pretest will be discussed in 5.6. Although no data were available regarding their English proficiency, an analysis of learning phase performance suggested that the four retrieval frequency groups might not have differed significantly from each other in their ability to learn from flashcards (see 5.6.1).
5.5.2 Experimental design There were three independent variables in the current study. The first independent variable was the retrieval frequency: one, three, five, and seven. (See below for the justification for examining the effects of up to seven retrievals.) The second independent variable was the timing of feedback: immediate and delayed. The third independent variable was the retention interval (RI): immediate, 1-week delayed, and 4-week delayed posttests. The present experiment employed a mixed design. Retrieval frequency was a between-participant variable, and the timing of feedback and RI were within-participant variables. The dependent variables were effectiveness and efficiency. Effectiveness was measured by the number of correct responses on the
302 posttest. Efficiency was defined as the number of words acquired per minute and calculated by dividing the posttest score by the study time (e.g., Kornell, 2009; Mondria, 2003; Pyc & Rawson, 2007).
Figure 20 summarises the design of the current study. As described in 5.5.2, the participants were randomly assigned to one of the four groups: Retrieval 1, 3, 5, and 7. The Retrieval 7 group consisted of 26 participants, and each of the remaining three groups consisted of 24 participants. The participants in each group were randomly divided into two subgroups: Subgroups X and Y. Sixteen target word pairs were also divided into two sets of eight items: Sets A and B. The two subgroups of participants in each group studied both sets of word pairs under different feedback conditions (immediate or delayed), thus counterbalancing the effects of target items (see Figure 20). In other words, Subgroup X in the Retrieval 1 group studied Set A under the immediate feedback condition and Set B under the delayed feedback condition. Subgroup Y in the Retrieval 1 group studied Set B under the immediate feedback and Set A under the delayed feedback condition. The same was true for the Retrieval 3, 5, and 7 groups.
303
Note. aEach item set consisted of eight items. Figure 20. Design of Study 4.
As noted in 5.1, previous studies on retrieval frequency used only two (Karpicke & Roediger, 2007a; Logan & Balota, 2008; Rohrer et al., 2005) or three levels of retrieval frequency (Karpicke & Roediger, 2007b; Tulving, 1967; Zaromb & Roediger, 2010). In order to obtain a more comprehensive picture regarding the effects of retrieval frequency, four retrieval frequency levels (one, three, five, and seven) were used in this study. The maximum number of retrievals was set to seven based on Karpicke and Roediger (2008). Karpicke and Roediger found that when the retrieval frequency was four and feedback was delayed, the meanings of about 80% of the target words were recalled correctly on a 1-week delayed, receptive recall posttest (see 5.1). Considering that the present study administered productive as well as receptive recall posttests and used a longer RI (4 weeks) than Karpicke and Roediger (2008), the effects of more than four retrievals might need to be examined. The maximum number of retrievals, therefore, was set to seven.
304
5.5.3 Target and filler words In the pilot studies, the same 20 low frequency English words from Studies 1 and 3 were used as target items. Based on the results of the pilot studies, two changes were made to the target items in the main data collection (see 5.4). First, the number of target items was reduced from 20 to 16 to ensure that the participants would have enough time to complete the experiment. Second, items with low posttest scores in the pilot studies were replaced in order to lower task difficulty and prevent a possible floor effect on the delayed posttests. Table 49 presents English-Japanese word pairs used in the present experiment. Among the 16 target words, all items except husk, jibe, and urn were used as target items in Studies 1 and 3. Husk and urn were used as filler items in Studies 1 and 3, and jibe was used as a filler item in Study 3. Because Studies 1 and / or 3 suggested that husk, jibe, and urn were likely to be unfamiliar to participants and they might be relatively easy to learn, they were used as target items in this study in order to prevent a possible floor effect on the delayed posttests. All 16 target words met the three criteria for selection of target words described in 2.2.6.
305 Table 49 English-Japanese Word Pairs Used in Study 4 Target items
Filler items
English apparition billow citadel dally gouge grig husk jibe lava tyro
Japanese 幽霊 うねる 砦 ふざける 彫る コオロギ 皮 調和する 溶岩 初心者
POS N V N V V N N V N N
BNC 14 12 12 15 11 off-list 12 11 10 15
English levee loach mane mirth rue toupee urn warble valor
Japanese 堤防 ドジョウ たてがみ 陽気 後悔する かつら つぼ さえずる 勇気
POS N N N N V N N V N
BNC 18 off-list 11 11 10 16 12 15 12
Note. POS = part of speech; BNC = BNC frequency level based on Nation (2004, 2006).
One disadvantage of decreasing the number of target words may be that it may reduce the potential of showing a difference between immediate and delayed feedback. As feedback timing was manipulated between participants (see Figure 20), there were only eight items per feedback type (16 / 2 = 8; Item set A or B) when the number of target items was reduced to 16. Despite this potential disadvantage, it was decided to decrease the target items from 20 to 16 for two reasons. First, reducing the target items may not considerably affect the probability of detecting a significant feedback timing effect because previous studies on paired-associate learning found significant differences between conditions using a relatively small number of items per condition: two (Cull et al., 1996, Experiment 2; Landauer & Bjork, 1978), four (Balota et al., 2006; Cull et al., 1996, Experiments 1 & 4; Logan & Balota, 2008), five (Cull et al., 1996, Experiment 3), six (Karpicke & Roediger, 2007a), and eight (Cull, 2000; Maddox et al., 2011). These studies suggest that even if the number of target items is
306 reduced to eight per condition, we may still be able to observe a significant feedback timing effect provided that such an effect exists. Second, even if the number of target items is reduced, it may not considerably decrease the probability of detecting a significant effect of feedback timing because the cell size is relatively large (98; Figure 20). For the above two reasons, it was decided to reduce the number of target items from 20 to 16.
The 16 word pairs were divided into Sets A and B so that the learning difficulty would be distributed as evenly as possible (Figure 21). As in Study 3, learning difficulty was operationally defined as learners’ performance in Experiment 1A (see Table 50). Each item set consisted of five nouns and three verbs (Figure 21). It was not possible to use the data from Experiment 1B or Study 3 because Study 4 was conducted before they were complete. The data for jibe and urn are not included in Table 50 because they were not used in Experiment 1A. Although the pretest scores for husk are included in Table 50, its posttest scores are not included. This is because husk was used as primacy and recency buffers in Experiment 1A, and its posttest scores may not necessarily be an accurate index of learning difficulty due to serial position effects (see 2.2.6).
Set Target words A billow, gouge, grig, jibe, levee, loach, toupee, urn B apparition, citadel, dally, husk, mane, mirth, rue, warble Figure 21. Target items by item set. Underlined words are verbs while others are nouns.
307 Table 50 Pretest and Posttest Scores in Experiment 1A Receptive Pretest Set A
B
Productive
Immediate
Delayed
Immediate
Delayed
Strict
Sensitive
Strict
Sensitive
Strict
Sensitive
Strict
Sensitive
Strict
Sensitive
M
0.70%
0.70%
80.59%
81.68%
62.82%
63.19%
70.88%
77.84%
19.96%
31.32%
SD
0.78%
0.78%
10.24%
8.46%
11.83%
11.36%
9.74%
6.82%
10.22%
8.46%
M
0.60%
0.60%
75.20%
77.86%
56.67%
58.40%
67.03%
76.61%
18.37%
31.24%
SD
0.77%
0.77%
14.34%
12.56%
18.60%
17.45%
16.81%
13.66%
13.96%
13.74%
Note. n = 95 for the pretest. n = 91 for the immediate and delayed posttests because four students who demonstrated prior knowledge of one or more target words on the receptive pretest were excluded (see 2.2.3). The productive pretest was not administered in Experiment 1A.
The Mann-Whitney nonparametric tests showed that no statistically significant difference existed between the two sets in any of the following variables: learners’ scores on the receptive pretest, immediate receptive posttest, delayed receptive posttest, immediate productive posttest, and delayed productive posttest in Experiment 1A, U < 16.00, p > .473, r < .20. Because the posttest results for husk and pretest and posttest results for jibe and urn are not included, the above analysis does not necessarily guarantee that the two sets are completely equivalent in their difficulty. Yet, it was judged that a possible difference, if any, might not have a major effect on the results of the present study because effects of target items would be counterbalanced across participants (see Figure 20).
Three additional items were used as filler items: tyro, valor, and lava. They were chosen based on the same criteria as the target items (see 2.2.6). Filler items were used as primacy and recency buffers (see 5.5.5 for further details). Filler items were treated in exactly the same way as target items during the treatment, and participants
308 were not informed that filler items would be used. The number of filler items was set to three because the treatment required three primacy and recency buffers (see 5.5.5).
5.5.4 Procedure The experiment was conducted with a computer program developed by the author. Although data were collected during regular class hours, participants studied and were tested individually, and the experimental settings were closer to those in laboratory studies than traditional classroom studies (Butler et al., 2007; Roediger et al., 2010). The experiment consisted of four sessions. In Session 1, participants received explanations about the study. Session 2 was conducted 1 week after Session 1 and comprised of the practice period, pretest, treatment, filler task, immediate posttest, and questionnaire. In Session 3, the 1-week delayed posttest was administered. In Session 4, the 4-week delayed posttest was administered. The procedure in this study differed from those of Experiment 1B and Study 3 in two respects. First, the treatment in this study differed from those in the previous experiments (see 5.5.5). Second, while the delayed posttest was administered 1 week after the treatment in the previous experiments, two delayed posttests (1-week and 4-week delayed) were administered in this study (see 5.5.7).
5.5.5 Treatment Figure 22 presents the overview of the treatment in the present study. The treatment in this study consisted of the initial presentation, retrieval phases (e.g., R1, R2, and R3 in
309 Figure 22), delayed feedback phases (e.g., D1, D2, and D3 in Figure 22), and final review. In the initial presentation, as in Studies 1 and 3, the English and Japanese words were presented simultaneously for 8 seconds per word pair. Each word pair was presented only once in the initial presentation. The initial presentation was followed by a series of alternating retrieval and delayed feedback phases. There were one, three, five, and seven sets of retrieval and delayed feedback phases for the Retrieval 1, 3, 5, and 7 groups, respectively (Figure 22). For items assigned to the immediate feedback condition, feedback was provided to the participants immediately after each response during the retrieval phase (e.g., R1, R2, and R3 in Figure 22). Delayed feedback was given after each retrieval phase (e.g., D1, D2, and D3 in Figure 22; see below for the justification).
Note. aR = retrieval phase; D = delayed feedback phase. Figure 22. Treatment in Study 4.
In the retrieval phase, as in Experiment 1B and Study 3, items were practised in a productive recall format (see 2.2.7). Each word pair appeared only once in each retrieval phase. Learners’ responses were categorised as correct, partially correct, or incorrect by the computer program, and different kinds of feedback were given for each type of response (see 4.5.6). As in Studies 1 and 3, the duration of feedback was
310 set to 5 seconds per response. For items assigned to the immediate feedback condition, feedback was provided to the participants immediately after each response during the retrieval phase. While immediate feedback was given immediately after each response in some studies (e.g., Butler et al., 2007; Carpenter & Vul, 2011; More, 1969; Sassenrath & Yonge, 1969; Sturges, 1969, 1972), in other studies (e.g., Kulhavy & Anderson, 1972; Paige, 1966; Phye & Andre, 1989; Phye & Baller, 1970; Phye, Gugliemella, & Sola, 1976), it was given after a shorter delay (e.g., 5 trials) than in the delayed feedback condition (e.g., 20 trials). In the present study, immediate feedback was provided immediately after each response because it may reflect authentic flashcard learning and increase ecological validity (see 5.2).
Delayed feedback was not given until the end of each retrieval phase (e.g., D1, D2, and D3 in Figure 22), where feedback for all eight delayed feedback items was presented one at a time. Delayed feedback was withheld until after each retrieval phase for two reasons. First, previous studies suggest that withholding feedback until the end of each retrieval phase may be effective (e.g., Karpicke & Bauernschmidt, 2011; Karpicke & Roediger, 2007b, 2008; Karpicke, 2009; Rohrer et al., 2005). Second, delaying feedback until after each retrieval phase may enhance learning because it may help introduce larger spacing for delayed feedback items. For instance, suppose that feedback for the first item (hereafter, Item 1) in the first retrieval phase (R1 in Figure 22) is given after the fifth item in the same retrieval phase. In this case, the retrieval and feedback opportunities for Item 1 are separated by four trials for
311 other items (second - fifth items in the first retrieval phase). In contrast, suppose that feedback for Item 1 is withheld until the first delayed feedback phase (D1 in Figure 22). This will ensure that the retrieval and feedback opportunities for this item are separated by at least 15 trials for other items (second - 16th items in the first retrieval phase). As larger spacing generally leads to better long-term retention than shorter spacing (lag effect; see 4.1.1), withholding feedback until the end of each retrieval phase may increase learning. Based on these reasons, delayed feedback was not given during the retrieval phase.
The retrieval and delayed feedback phases were followed by the final review (Brown, 1924; Kornell, 2009, Experiment 3; McGeoch, 1931, Experiment 3), where the English and Japanese words were presented simultaneously for 5 seconds per word pair. The duration of the final review was set to 5 seconds per word pair because previous studies suggest that 5 seconds is sufficient for learning (Karpicke & Roediger, 2008; Karpicke, Smith, et al., 2009). Each word pair was presented only once in the final review. The final review was included to control for lag to test (see 5.2) in the immediate and delayed feedback items. As each retrieval phase was followed by a delayed feedback phase (Figure 22), without the final review, the delayed feedback items would be clustered at the end of the treatment and have a shorter interval to the posttest than the immediate feedback items. The inclusion of the final review ensured that the immediate and delayed feedback items were controlled for lag to test.
312
In the Retrieval 1 group, participants were presented with the target word pairs three times (initial presentation + 1 feedback opportunity + final review). Similarly, participants in the Retrieval 3, 5, and 7 groups were exposed to the target word pairs five (initial presentation + 3 feedback opportunities + final review), seven (initial presentation + 5 feedback opportunities + final review), and nine times (initial presentation + 7 feedback opportunities + final review), respectively. Note that unlike some previous studies (Karpicke & Roediger, 2007a; Tulving, 1967; Zaromb & Roediger, 2010), higher retrieval frequency was not associated with lower exposure frequency in the current study (see 5.1). Because the present study did not employ a dropout schedule, all target items received the same retrieval frequency in a given group (e.g., in the Retrieval 3 group, all target items were practised three times regardless of the learner's performance). The retrieval frequency, therefore, was not confounded with the item difficulty either (see 5.1).
Order of items The order of items during the treatment was determined based on five principles. First, in order to lessen the influence of serial position effects, the three filler items (husk, jibe, and urn) were used as primacy and recency buffers at the beginning and end of the treatment in all four groups (Karpicke & Roediger, 2007a; see 2.2.6). Second, in order to ensure that the order of items would not offer inappropriate help in remembering (Mondria & Mondria-de Vries, 1994; Nation, 2001, pp. 306-307), the
313 item order was randomised at the beginning of each retrieval phase, delayed feedback phase, and final review. Third, the item order in the retrieval phase was determined randomly by the flashcard program with the constraint that retrieval attempts of a given item were separated by at least eight retrieval attempts for other items. For instance, if a given item appeared as the last item in the first retrieval phase, it did not appear until the ninth position in the second retrieval phase so that there would be at least eight intervening retrieval attempts (first - eighth items in the second retrieval phase). Similarly, the item order in the delayed feedback phase and final review was determined randomly with the constraint that trials for a given item were separated by at least four trials for other items. A smaller interval was not used because larger spacing generally leads to better long-term retention than shorter spacing (lag effect; see 4.1.1). Fourth, the immediate and delayed feedback items alternated throughout the treatment to ensure that the trials for both types of items would be distributed roughly equally across the treatment. This was done because the positions of the trials might affect learning (e.g., Delaney et al., 2010; Karpicke & Roediger, 2007a). The only exception to this principle was the delayed feedback phase, where eight delayed feedback items were presented without being intervened by immediate feedback items. Fifth, the item order was randomised anew for each participant to minimise the potential of an order effect.
314 5.5.6 Pretest Immediately before the treatment, productive and receptive recall tests were given as the pretest in that order. The formats of the pretests were exactly the same as in Experiment 1B and Study 3 (see 2.3.4 and 4.5.7). In addition to the 16 target words, three filler items and three practice words (apple, orange, and banana) were tested. There was one question for each word, and each pretest consisted of 22 questions. At the beginning of the pretest, the three practice words were tested in order to familiarise participants with the test format. Responses for filler items and practice words were not included in the results.
As in Experiment 1B and Study 3, in order to prevent learners from providing synonyms in the productive pretest, one letter in the target word and the number of letters in the word (hereafter, retrieval cue) were given together with the Japanese translation (e.g., _ _ n _ for mane). The retrieval cues in the present study were determined based on the same four rules as in Experiment 1B and Study 3 (2.3.4 and 4.5.7). Appendix G summarises the retrieval cues used in this study. See Appendix A for examples of the productive and receptive pretests. Responses on the pretest were scored using the same procedures as in Studies 1 and 3: strict and sensitive (see 2.2.11).
5.5.7 Dependent measures Productive and receptive recall tests were given as the posttest in that order. The
315 formats of the posttests were exactly the same as in Studies 1 and 3 (see 2.2.10). In addition to the 16 target words, three filler items were tested. There was one question for each word, and each posttest consisted of 19 questions. The three filler items were tested at the beginning of each posttest in order to familiarise participants with the test format. See Appendix B for examples of the productive and receptive posttests. Responses on the posttest were scored using the same procedures as in Studies 1 and 3: strict and sensitive (see 2.2.11). Responses for filler items were not included in the results.
Immediate and delayed posttests were administered in the present study. The immediate posttest was given on the same day as the treatment. The delayed posttest was administered 1 and 4 weeks after the treatment. The RIs of 1 and 4 weeks were chosen based on Rohrer et al. (2005). They found that although repeated retrieval may be effective at a 1-week RI, its advantage may not be observed if a posttest is given at an RI greater than 1 week (see 5.1). Their findings suggest that when examining the effects of retrieval frequency, it may be useful to use an RI greater than 1 week as well as that of 1 week or shorter. For this reason, the RIs of 1 and 4 weeks were chosen. The delayed posttests were administered without prior notice so that participants would not review the target words during the period between the treatment and delayed posttests.
316 5.5.8 Questionnaire After the immediate posttest, a questionnaire was administered in Japanese to examine participants’ perceptions about the effects of retrieval frequency and feedback timing on learning. First, participants were asked to indicate whether they thought the retrieval frequency during the treatment (e.g., one in the Retrieval 1 group and seven in the Retrieval 7 group) was sufficient for learning on a 7-point scale, where 1 means Insufficient and 7 means Sufficient. The participants were also asked to evaluate the effectiveness of immediate and delayed feedback on a 7-point scale, where 1 means I learned more with immediate feedback and 7 means I learned more with delayed feedback.
5.6 Results Pretest None of the participants exhibited prior knowledge of any of the target words on the productive pretest. Participants did not provide synonyms for a target word on the productive pretest either possibly due in part to the provision of retrieval cues (e.g., _ _ n _ for mane). Table 51 summarises the results of the receptive pretest. In order to correct for differences in the pretest scores, gains from the pretest to the posttest were analysed when examining the receptive test results (see 5.6.1 and 5.6.2).
317 Table 51 Number of Correct Responses on the Receptive Pretest Strict scoring Sensitive scoring Groups IMFB DLFB Total IMFB DLFB Total M 0.00 0.13 0.13 0.00 0.13 0.13 Retrieval 1 (n = 24) SD 0.00 0.34 0.34 0.00 0.34 0.34 M 0.04 0.08 0.13 0.04 0.13 0.17 Retrieval 3 (n = 24) SD 0.20 0.28 0.34 0.20 0.34 0.38 M 0.00 0.08 0.08 0.00 0.08 0.08 Retrieval 5 (n = 24) SD 0.00 0.28 0.28 0.00 0.28 0.28 M 0.12 0.00 0.12 0.15 0.00 0.15 Retrieval 7 (n = 26) SD 0.33 0.00 0.33 0.37 0.00 0.37 Note. The maximum score is 8 for IMFB and DLFB and 16 for Total. IMFB = immediate feedback condition; DLFB = delayed feedback condition.
5.6.1 Effects of retrieval frequency Study time First, we will consider the effects of retrieval frequency. The data obtained in the immediate and delayed feedback conditions were added together when examining the effects of retrieval frequency. For instance, if a participant scored 7 under the immediate feedback condition and 8 under the delayed feedback condition on a given posttest, 15 was used as his or her posttest score because both of the scores had the same retrieval frequency. Let us examine at the outset whether significant differences existed among the four groups in study time. The average study time (SDs in parentheses) was 6.48 (1.12), 12.03 (1.73), 17.98 (1.59), and 22.82 (3.31) minutes in the Retrieval 1, 3, 5, and 7 groups, respectively. A one-way ANOVA found a statistically significant difference among the four groups, F (3, 97) = 274.17, p < .001, η2 = .81. The Bonferroni method of multiple comparisons showed that all four groups were significantly different from each other (p < .001), producing large effect sizes
318 (1.88 < d < 8.39).
Learning phase performance There were one, three, five, and seven retrieval attempts for each target word during the treatment in the Retrieval 1, 3, 5, and 7 groups, respectively (see 5.5.5). Table 52 summarises the number of correct responses in the four groups. For instance, the table shows that in the Retrieval 7 group, the average number of correct responses was 5.54, 8.69, 11.23, 13.50, 13.88, 14.96, and 14.58 out of 16 for each retrieval attempt. In order to test whether significant differences existed among the four groups in their ultimate attainment during the learning phase, the number of correct responses on the last retrieval attempt in each group (e.g., the first retrieval in the Retrieval 1 group and the seventh retrieval in the Retrieval 7 group) was submitted to a one-way ANOVA with the retrieval frequency group as an independent variable.
Table 52 Number of Correct Responses During the Learning Phase Retrieval attempts Groups 1 2 3 4 5 6 7 M 5.13 Retrieval 1 (n = 24) SD 2.82 M 5.00 8.08 10.96 Retrieval 3 (n = 24) SD 3.38 3.98 3.95 M 4.79 8.83 12.04 13.92 15.17 Retrieval 5 (n = 24) SD 2.93 3.41 3.52 2.64 1.69 M 5.54 8.69 11.23 13.50 13.88 14.96 14.58 Retrieval 7 (n = 26) SD 4.37 4.58 4.32 3.34 2.88 2.18 2.25 Note. The maximum score is 16 for each cell. Responses were scored with the strict scoring procedure (see 2.2.11).
319 The ANOVA found a statistically significant difference among the four groups, F (3, 97) = 66.29, p < .001, η2 = .46. The Bonferroni method of multiple comparisons showed that no statistically significant difference existed between the Retrieval 5 and 7 groups (p = 1.000), producing a small effect size (d = 0.30). The differences were statistically significant for all other comparisons (p < .001), and large effect sizes were observed (1.16 < d < 4.33). The difference between the Retrieval 5 and 7 groups was not significant presumably because they neared the ceiling on the last retrieval attempt (Retrieval 5 group: 94.8%; Retrieval 7 group: 91.1%).
Because all four groups underwent identical experimental procedures during the period up to the first retrieval attempt (i.e., initial presentation + first retrieval; Figure 22), we may be able to expect all groups to have produced a similar level of retrieval success on the first attempt as long as they did not differ in their ability to learn from flashcards. Similarly, the Retrieval 3, 5, and 7 groups should have yielded similar levels of retrieval success on the second and third retrieval attempts, and the Retrieval 5 and 7 groups should have performed equally well on the fourth and fifth retrieval attempts provided that they were equivalent in their ability to learn from flashcards. In order to test the above assumptions, three separate analyses were conducted. First, the number of correct responses obtained by the four retrieval frequency groups on the first retrieval attempt was submitted to a one-way ANOVA with the retrieval frequency group (Retrieval 1 / 3 / 5 / 7) as an independent variable. No statistically significant difference was found among the four groups, F (2, 97) = 0.21, p = .890,
320 and no effect size was found (η2 < .01). The result suggests that the four groups did not differ significantly from each other in their degree of success on the first retrieval attempt.
Second, for the Retrieval 3, 5, and 7 groups, the number of correct responses on the second and third retrieval attempts was submitted to a two-way mixed design 3 (group: Retrieval 3 / 5 / 7) x 2 (retrieval attempt: 2 / 3) ANOVA. Neither the main effect of group, F (1, 71) = 0.34, p = .712, partial η2 = .01, nor the interaction between the group and retrieval attempt was significant, F (2, 71) = 0.61, p = .549, partial η2 = .02. Third, for the Retrieval 5 and 7 groups, the number of correct responses on the fourth and fifth retrieval attempts was submitted to a two-way mixed design 2 (group: Retrieval 5 / 7) x 2 (retrieval attempt: 4 / 5) ANOVA. The ANOVA detected a significant interaction between the group and retrieval attempt, F (1, 48) = 6.35, p = .015, partial η2 = .12. The main effect of group was not significant, F (1, 48) = 1.28, p = .263, partial η2 = .03. Due to the significant interaction, the simple main effect of group was tested. The simple main effect of group was not significant on the fourth, F (1, 48) = 0.24, p = .628, and fifth retrieval attempts, F (1, 48) = 3.62, p = .063, indicating that the difference between the Retrieval 5 and 7 groups was not significant on either retrieval. Taken together, the results suggest that all four groups produced roughly similar levels of retrieval success during the period when they underwent identical experimental procedures. Thus it might be reasonable to assume that the four retrieval frequency groups did not differ significantly from each other in their ability
321 to learn from flashcards.
Posttest performance Table 53 provides the immediate and delayed posttest results for the four retrieval frequency groups. The productive and receptive test scores were analysed by a three-way mixed design 4 (group: Retrieval 1 / 3 / 5 / 7) x 2 (feedback timing: immediate / delayed) x 3 (RI: immediate / 1-week delayed / 4-week delayed) ANOVA. As some items were answered correctly on the receptive pretest (Table 51), the pretest scores were subtracted from the posttest scores, and gains were analysed when examining the receptive test results. Table 54 shows the results of the ANOVAs. As shown by the table, the ANOVAs showed significant main effects of group and RI on both productive and receptive tests with both scoring procedures. The interaction between the group and RI was not significant on either the productive or receptive posttest regardless of the scoring system.
322
Table 53 Number of Correct Responses on the Posttests RI Immediate 1 week 4 weeks Posttests Groups Strict Sensitive Strict Sensitive Strict Sensitive Productive M 9.46 11.33 4.42 6.13 3.88 5.71 Retrieval 1 (n = 24) SD 4.08 4.09 3.84 4.26 3.70 4.48 M 10.83 12.38 4.67 6.08 4.04 5.92 Retrieval 3 (n = 24) SD 4.25 3.84 3.77 4.04 3.30 3.34 M 15.50 15.75 7.83 10.67 7.67 10.54 Retrieval 5 (n = 24) SD 0.93 0.74 3.42 3.82 4.42 4.16 M 15.23 15.62 9.46 11.08 7.81 10.62 Retrieval 7 (n = 26) SD 1.68 1.10 3.92 3.88 4.11 4.46 Receptive M 11.42 11.83 10.08 10.33 9.58 9.88 Retrieval 1 (n = 24) SD 3.53 3.50 4.05 4.11 3.54 3.67 M 11.38 11.79 10.13 10.21 9.58 9.71 Retrieval 3 (n = 24) SD 3.87 4.02 4.43 4.42 3.98 4.05 M 14.33 15.00 13.50 14.13 13.63 13.92 Retrieval 5 (n = 24) SD 1.58 1.32 1.89 1.78 2.24 2.10 M 14.73 15.27 14.23 14.54 13.35 13.69 Retrieval 7 (n = 26) SD 1.73 1.04 1.99 1.88 3.43 3.52 Note. The maximum score is 16 for each cell. Strict = strict scoring; Sensitive = sensitive scoring (see 2.2.11).
Collapsed across the RIs Strict Sensitive 5.92 7.72 3.32 3.59 6.51 8.13 3.37 3.29 10.33 12.32 2.55 2.60 10.83 12.44 2.72 2.64 10.36 10.68 3.47 3.50 10.36 10.57 3.65 3.71 13.82 14.35 1.65 1.46 14.10 14.50 1.96 1.86
323
Due to the significant main effect of group (Table 54), the Bonferroni method of multiple comparisons was used to investigate where the significant differences lay at each RI. The results of the multiple comparisons are summarised in Table 55 and Table 56. For instance, Table 55 shows that with strict scoring on the productive posttest, the difference between the Retrieval 1 and 3 groups was not statistically significant (p = .747), and a small effect size was observed (d = 0.33). Table 55 and Table 56 indicate that regardless of the posttest, scoring system, and RI, the Retrieval 5 and 7 groups significantly outperformed the Retrieval 1 and 3 groups, producing large effect sizes (0.88 < d < 2.04). However, no statistically significant difference existed between the Retrieval 1 and 3 groups on one hand and between the Retrieval 5 and 7 groups on the other, yielding no or small effect sizes (d < 0.45). The findings suggest the following order on all posttests: Retrieval 5 = 7 > 1 = 3. One caveat to be considered is that the immediate posttest scores of the Retrieval 5 and 7 groups neared the ceiling (Table 53). Hence, the lack of statistical significance between these two groups on the immediate posttest may be partly ascribed to a possible ceiling effect. On the delayed posttest, however, neither a ceiling nor floor effect was observed. The lack of significant differences between the Retrieval 5 and 7 groups on the delayed posttest, therefore, seems to indicate that seven retrievals may not significantly increase learning compared with five.
324
Table 54 Results of Three-Way ANOVAs for the Posttest Scores Productive posttest Strict scoring Effects
df
F
p
3
17.57
.000
1.87
241.42
Group X RI
6
Feedback timing
Receptive posttest Sensitive scoring
partial
df
F
p
.36
3
17.47
.000
.000
.72
1.84
144.17
1.80
.100
.05
6
1
0.06
.814
.00
Group X Feedback timing
3
0.78
.511
Feedback timing X RI
2
0.60
Group X RI X Feedback
6
0.64
Group RI
Strict scoring
partial
df
F
p
.36
3
13.24
.000
.000
.61
1.77
16.32
0.60
.727
.02
6
1
0.17
.678
.00
.02
3
2.00
.119
.550
.01
2
2.07
.702
.02
6
0.93
2
η
Sensitive scoring partial
partial
df
F
p
.30
3
14.97
.000
.32
.000
.15
1.76
22.08
.000
.19
0.76
.601
.02
6
0.57
.754
.02
1
0.11
.741
.00
1
0.22
.641
.00
.06
3
1.51
.217
.05
3
0.70
.553
.02
.129
.02
2
4.87
.009
.05
2
3.69
.027
.04
.472
.03
6
0.33
.921
.01
6
0.87
.518
.03
2
η
2
η
Note. Some dfs contain decimal values due to the Greenhouse-Geiser correction (Field, 2009). The main effect of feedback timing and interactions involving feedback timing will be discussed in 5.6.2.
η2
325
Table 55 Results of Multiple Comparisons on the Productive Posttest Strict scoring Sensitive scoring Retrieval 1 3 5 7 1 3 5 RI frequency p d p d p d p d p d p d p d Immediate 1 3 .747 0.33 1.000 0.26 5 .000 2.04 .000 1.52 .000 1.50 .001 1.22 7 .000 1.92 .000 1.41 1.000 0.20 .000 1.95 .001 1.53 1.000 0.15 1 week 1 3 1.000 0.07 1.000 0.01 5 .013 0.94 .026 0.88 .001 1.12 .001 1.17 7 .000 1.33 .000 1.27 .770 0.45 .000 1.24 .000 1.29 1.000 0.11 4 weeks 1 3 1.000 0.05 1.000 0.05 5 .007 0.93 .011 0.93 .001 1.12 .001 1.23 7 .004 1.24 .006 1.29 1.000 0.11 .000 1.12 .001 1.21 1.000 0.02 Note. Effect sizes (d) of 0.20, 0.50, and 0.80 are indicative of small, medium, and large effects, respectively (Cohen, 1988).
7 p
d
326
Table 56 Results of Multiple Comparisons on the Receptive Posttest Strict scoring Retrieval 1 3 5 RI frequency p d p d p d Immediate 1 3 1.000 0.01 5 .004 1.07 .003 0.99 7 .001 1.22 .001 1.13 1.000 0.22 1 week 1 3 1.000 0.01 5 .003 1.10 .003 0.98 7 .000 1.36 .000 1.22 1.000 0.35 4 weeks 1 3 1.000 0.00 5 .000 1.37 .000 1.24 7 .001 1.11 .001 1.03 1.000 0.11
7 p
1 d
Sensitive scoring 3 5 p d p
7
p
d
d
1.000 .001 .000
0.02 1.20 1.35
.001 .000
1.08 1.21
1.000
0.17
1.000 .001 .000
0.04 1.22 1.37
.000 .000
1.16 1.30
1.000
0.19
1.000 .001 .001
0.05 1.36 1.08
.000 .001
1.31 1.07
1.000
0.10
p
d
327 Because the ANOVAs detected significant main effects of RI on both productive and receptive tests (Table 54), the Bonferroni method of multiple comparisons was used to examine whether the posttest scores decreased significantly as a function of the RI. The multiple comparisons showed that when collapsed across the four retrieval frequency groups, with strict scoring on the productive posttest, the immediate posttest scores were significantly higher than both the 1-week (p < .001) and 4-week delayed posttest scores (p < .001), producing large effect sizes (1.44 < Δ < 1.61). The difference between the 1- and 4-week RIs was also statistically significant (p = .042), but only a small effect size was observed (Δ = 0.18). With sensitive scoring on the productive posttest, the immediate posttest scores were significantly higher than the 1-week (p < .001) and 4-week delayed posttest scores (p < .001), and large effect sizes were found (1.14 < Δ < 1.18). The difference between the 1- and 4-week RIs, however, was not statistically significant (p = 1.000), yielding a very small effect size (Δ = 0.06).
On the receptive posttest, when collapsed across the four groups, the immediate test scores were significantly higher than the 1-week (p < .001) and 4-week delayed test scores (p < .001) with both strict and sensitive scoring. Yet, only small effect sizes were found (0.26 < Δ < 0.42). The differences between the two delayed posttest scores were not statistically significant regardless of the scoring procedure (p > .052), producing very small effect sizes (0.12 < Δ < 0.13). The results can be summarised as follows: on the productive test, immediate > 1-week delayed = 4-week delayed; on the
328 receptive test, immediate > 1-week delayed = 4-week delayed.
Efficiency As there was a statistically significant difference in study time among the four retrieval frequency groups (see above), the efficiency of the four groups was also compared. Efficiency was defined as the number of words acquired per minute and calculated by dividing the posttest score by the study time (e.g., Kornell, 2009; Mondria, 2003; Pyc & Rawson, 2007). Because some items were answered correctly on the receptive pretest (Table 51), gains were used when calculating the efficiency scores for the receptive tests. Table 57 provides the efficiency scores in the four groups.
329 Table 57 Efficiency Scores in the Four Retrieval Frequency Groups
Posttests Groups Productive Retrieval 1 M (n = 24) SD Retrieval 3 M (n = 24) SD Retrieval 5 M (n = 24) SD Retrieval 7 M (n = 26) SD Receptive
Retrieval 1 M (n = 24) SD Retrieval 3 M (n = 24) SD Retrieval 5 M (n = 24) SD Retrieval 7 M (n = 26) SD
Immediate Strict Sensitive 1.45 1.74 0.58 0.60 0.90 1.02 0.33 0.30 0.87 0.88 0.10 0.09 0.68 0.70 0.13 0.12
RI 1 week Strict Sensitive 0.67 0.93 0.59 0.65 0.39 0.51 0.33 0.35 0.43 0.59 0.18 0.20 0.43 0.50 0.21 0.21
4 weeks Strict Sensitive 0.57 0.85 0.50 0.60 0.34 0.49 0.28 0.28 0.43 0.58 0.25 0.22 0.35 0.48 0.21 0.23
1.74 0.50 0.93 0.32 0.80 0.10 0.66 0.14
1.54 0.60 0.84 0.38 0.75 0.10 0.63 0.14
1.46 0.50 0.80 0.34 0.75 0.12 0.59 0.19
1.81 0.50 0.96 0.33 0.84 0.10 0.68 0.12
1.58 0.61 0.85 0.38 0.78 0.10 0.65 0.14
1.50 0.52 0.80 0.34 0.77 0.11 0.61 0.20
In order to test whether any significant difference existed among the four groups, the efficiency scores were entered into a two-way mixed design 4 (retrieval frequency group) x 3 (RI) ANOVA. Table 58 shows the results of the ANOVAs. The ANOVAs showed significant main effects of group and RI on both productive and receptive tests with both scoring procedures. The interaction between the two variables was also significant on both productive and receptive tests regardless of the scoring system. Due to the significant interaction, the simple main effect of retrieval frequency was tested to investigate where the significant differences lay. The simple main effect of retrieval frequency fell short of significance with strict scoring on the 4-week delayed productive posttest, F (94, 3) = 2.52, p = .062, but was significant on all other posttests, F (94, 3) > 2.92, p < .038. To follow-up the significant simple main effect,
330 the Bonferroni method of multiple comparisons was used to investigate where the significant differences lay at each RI. The results of the multiple comparisons are summarised in Table 59 and Table 60. Table 58 Results of Two-Way ANOVAs for the Efficiency Scores Strict scoring
Sensitive scoring
Posttests
Effects
df
F
p
partial η
df
F
p
partial η2
Productive
RI
2
164.64
.000
.64
2
104.98
.000
.53
Group X RI
6
7.90
.000
.20
6
9.87
.000
.24
Group
3
9.90
.000
.24
3
20.53
.000
.40
1.83
16.01
.000
.15
1.77
19.61
.000
.17
Group X RI
6
2.84
.011
.08
6
2.89
.010
.08
Group
3
49.14
.000
.61
3
51.85
.000
.62
Receptive
RI
2
331
Table 59 Results of Multiple Comparisons for Efficiency Scores on the Productive Posttest Strict scoring Retrieval 1 3 5 RI frequency p d p d p d Immediate 1 .000 1.16 3 .000 1.38 1.000 0.11 5 .000 1.88 .182 0.87 .357 1.60 7 1 week 1 .059 0.58 3 .162 0.54 1.000 0.15 5 .132 0.56 1.000 0.14 1.000 0.02 7 4 weeks 1 .097 0.57 3 .816 0.36 1.000 0.34 5 .134 0.58 1.000 0.07 1.000 0.33 7
7 p d
1
Sensitive scoring 3 5 p d p
7
p
d
d
.000 .000 .000
1.50 2.00 2.50
.914 .007
0.64 1.47
.366
1.83
.002 .020 .001
0.81 0.71 0.92
1.000 1.000
0.28 0.03
1.000
0.44
.007 .085 .003
0.75 0.58 0.84
1.000 1.000
0.36 0.06
1.000
0.48
p d
332
Table 60 Results of Multiple Comparisons for Efficiency Scores on the Receptive Posttest Strict scoring Retrieval 1 3 5 7 RI frequency p d p d p d p d Immediate 1 .000 1.93 3 .000 2.63 .717 0.58 5 .000 3.10 .010 1.18 .631 1.19 7 1 week 1 .000 1.40 3 .000 1.85 1.000 0.34 5 .000 2.17 .272 0.75 1.000 0.92 7 4 weeks 1 .000 1.56 3 .000 1.95 1.000 0.16 5 .000 2.38 .163 0.76 .466 1.00 7
1
Sensitive scoring 3 5 p d p
7
p
d
d
.000 .000 .000
1.99 2.68 3.22
.902 .008
0.53 1.22
.427
1.47
.000 .000 .000
1.45 1.83 2.21
1.000 .334
0.23 0.73
1.000
1.17
.000 .000 .000
1.60 1.95 2.38
1.000 .229
0.12 0.72
.467
1.03
p
d
333 Table 59 and Table 60 indicate the following four things. First, the Retrieval 1 group was significantly more efficient than the other three groups with strict and sensitive scoring on the immediate productive posttest (p < .001) and with sensitive scoring on the 1-week delayed productive posttest (p < .020), yielding medium to large effect sizes (0.71 < d < 2.50). Second, on the receptive posttest, the Retrieval 1 group was significantly more efficient than the other three groups regardless of the RI and scoring system (p < .001), and large effect sizes were observed (1.40 < d < 3.22). Third, with sensitive scoring on the 4-week delayed productive posttest, the Retrieval 1 group was significantly more efficient than the Retrieval 3 and 7 groups (p < .007), and medium to large effect sizes were found (0.75 < d < 0.84). Fourth, the Retrieval 3 group was more efficient than the Retrieval 7 group with sensitive scoring on the immediate productive posttest and with strict and sensitive scoring on the immediate receptive posttest (p < .010), producing large effect sizes (1.18 < d < 1.47). The differences were not statistically significant for all other comparisons. Overall, the results indicate that (a) the Retrieval 1 group was the most efficient among the four groups, and (b) the Retrieval 3 group was more efficient than the Retrieval 7 group on the immediate posttests except with strict scoring on the productive test.
Questionnaire In the questionnaire given after the immediate posttest, participants were asked to indicate whether they thought the retrieval frequency during the treatment (e.g., one in the Retrieval 1 group and seven in the Retrieval 7 group) was sufficient for learning
334 on a 7-point scale, where 1 means Insufficient and 7 means Sufficient (see 5.5.8). The average rating (SDs in parentheses) was 3.79 (1.91), 4.54 (1.53), 6.21 (0.98), and 6.23 (0.95) out of 7 for the Retrieval 1, 3, 5, and 7 groups, respectively. A one-way ANOVA found a statistically significant difference among the four groups, F (3, 96) = 18.89, p < .001, η2 = .14. The Bonferroni method of multiple comparisons showed that no statistically significant difference existed between the Retrieval 1 and 3 groups on one hand (p = .392) and between the Retrieval 5 and 7 groups on the other (p = 1.000), yielding no more than small effect sizes (0.02 < d < 0.43). The differences were statistically significant for all other comparisons (p < .001), and large effect sizes were observed (1.30 < d < 1.67). The findings suggest the following order in the ratings by learners: Retrieval 5 = 7 > 1 = 3, which is consistent with the posttest results. The results seem to indicate that learners may be aware of the benefits of repeated retrieval.
5.6.2 Effects of feedback timing Study time Next, we will consider the effects of immediate and delayed feedback. First, let us examine whether the study time was comparable between the two feedback conditions. Table 61 summarises the study time as a function of retrieval frequency and feedback timing. For instance, the table shows that the average study time was 3.20 minutes for the immediate and 3.28 minutes for the delayed feedback conditions in the Retrieval 1 group. The table also shows that the 95% confidence intervals of difference were
335 [-0.34, 0.18] in the Retrieval 1 group.
Table 61 Study Time (Minutes) as a Function of Retrieval Frequency and Feedback Timing Immediate feedback Delayed feedback CI of difference Groups M SD M SD Retrieval 1 (n = 24) 3.20 0.63 3.28 0.51 [-0.34, 0.18] Retrieval 3 (n = 24) 5.90 0.91 6.13 0.89 [-0.49, 0.03] Retrieval 5 (n = 24) 9.00 1.01 8.99 0.81 [-0.25, 0.27] Retrieval 7 (n = 26) 11.24 1.69 11.57 1.69 [-0.58, -0.08] Total (n = 98) 7.41 3.28 7.58 3.32 [-0.29, -0.03] Note. CI of difference = 95% confidence intervals of difference.
The study time was analysed by a two-way mixed design 2 (feedback timing: immediate / delayed) x 4 (group: Retrieval 1 / 3 / 5 / 7) ANOVA. The ANOVA showed a significant main effect of feedback timing, F (1, 94) = 5.86, p = .017, partial η2 = .06, Δ = 0.05. The interaction between the group and feedback timing was not significant, F (3, 94) = 1.35, p = .264, partial η2 = .04. The significant main effect of feedback timing means that when collapsed across the four retrieval frequency groups, delayed feedback had a longer study time than immediate feedback. However, only a very small effect size was observed (partial η2 = .06, Δ = 0.05), and the difference in the means was also small (7.41 minutes for immediate and 7.58 minutes for delayed feedback). The 95% confidence intervals of difference were also narrow [-0.29, -0.03]. These findings suggest that although statistically significant, the difference between the two feedback conditions in study time might not have been substantively (Kline, 2004) or practically significant (Kirk, 1996). The statistical significance may be partially due to the relatively large cell size (98; Kline, 2004; Norris & Ortega, 2000).
336 As the difference in study time was rather small, the efficiency scores (posttest score divided by the study time; see 5.5.2) were not calculated for the comparison of immediate and delayed feedback.
Next, let us investigate how much spacing intervened between the retrieval attempt and feedback in the delayed feedback condition. The retrieval attempt and delayed feedback were separated by 95.61 (45.76), 91.05 (21.32), 93.88 (15.89), and 90.72 (21.79) seconds on average in the Retrieval 1, 3, 5, and 7 groups, respectively (SDs in parentheses). When collapsed across the four groups, the mean interval duration was 92.77 (28.12) seconds. No statistically significant difference was found among the four groups in their mean ISIs, F (3, 97) = 0.17, p = .919, producing no effect size (η2 < .01). As the difference is relatively small, it may be possible to assume that the four retrieval frequency groups had roughly equivalent spacing between the retrieval attempt and delayed feedback.
Learning phase performance Table 62 summarises the number of correct responses during the learning phase as a function of retrieval frequency and feedback timing. In order to determine whether any significant difference existed between the two types of feedback, four separate analyses were conducted. First, the number of correct responses under the immediate and delayed feedback conditions in the Retrieval 1 group was analysed by a paired t-test. The difference was not statistically significant, yielding a very small effect size,
337 t (23) = 0.16, p = .873, r = .03. Second, for the Retrieval 3 group, the number of correct responses was submitted to a two-way mixed design 2 (feedback timing) x 3 (retrieval attempt: 1 / 2 / 3) ANOVA. Neither the main effect of feedback timing, F (1, 25) = 0.28, p = .602, partial η2 = .01, nor the interaction between feedback timing and the retrieval attempt was significant, F (2, 46) = 0.69, p = .508, partial η2 = .03. The results indicate that feedback timing might have had little effect on learning phase recall in the Retrieval 3 group.
Table 62 Number of Correct Responses During the Learning Phase Retrieval attempts Groups Feedback 1 2 3 4 5 6 7 Retrieval 1 Immediate M 2.58 (n = 24) SD 1.59 Delayed M 2.54 SD 1.50 Retrieval 3 Immediate M 2.38 3.96 5.50 (n = 24) SD 1.76 2.05 2.23 Delayed M 2.63 4.13 5.46 SD 1.81 2.05 2.11 Retrieval 5 Immediate M 2.71 4.13 5.71 6.71 7.46 (n = 24) SD 1.57 1.92 1.99 1.57 1.22 Delayed M 2.08 4.71 6.33 7.21 7.71 SD 1.69 2.12 1.81 1.35 0.62 Retrieval 7 Immediate M 2.77 3.96 5.46 6.77 6.77 7.35 7.15 (n = 26) SD 2.25 2.41 2.30 1.86 1.95 1.35 1.52 Delayed M 2.77 4.73 5.77 6.73 7.12 7.62 7.42 SD 2.41 2.59 2.23 1.78 1.24 1.10 0.99 Note. The maximum score is 8 for each cell. Responses were scored with the strict scoring procedure (see 2.2.11).
Third, the number of correct responses in the Retrieval 5 group was submitted to a two-way mixed design 2 (feedback timing) x 5 (retrieval attempt: 1 / 2 / 3 / 4 / 5)
338 ANOVA. The ANOVA showed a significant interaction between feedback timing and the retrieval attempt, F (4, 92) = 3.80, p = .007, partial η2 = .14. The main effect of feedback timing was not significant, F (1, 23) = 1.87, p = .185, partial η2 = .08. Due to the significant interaction, the simple main effect of feedback timing was tested to examine where the significant differences lay. The simple main effect of feedback timing was significant on the first, F (1, 23) = 4.53, p = .044, partial η2 = .16, Δ = 0.37, and third retrieval attempts, F (1, 23) = 4.53, p = .044, partial η2 = .16, Δ = 0.31. However, only small effect sizes were observed (0.31 < Δ < 0.37). The simple main effect of feedback timing was not significant on the rest of the retrievals: the second, F (1, 23) = 1.74, p = .200, partial η2 = .07, Δ = 0.30, fourth, F (1, 23) = 3.63, p = .069, partial η2 = .14, Δ = 0.32, and fifth retrievals, F (1, 23) = 1.68, p = .207, partial η2 = .07, Δ = 0.21, producing small effect sizes (0.21 < Δ < 0.32).
Fourth, for the Retrieval 7 group, the number of correct responses was analysed by a two-way mixed design 2 (feedback timing) x 7 (retrieval attempt: 1 / 2 / 3 / 4 / 5 / 6 / 7) ANOVA. Neither the main effect of feedback timing, F (1, 25) = 2.60, p = .119, partial η2 = .09, nor the interaction between feedback timing and the retrieval attempt was significant, F (3.21, 80.25) = 1.04, p = .383, partial η2 = .04. Overall, although statistically significant effects were found, given the small effect sizes, it might be reasonable to assume that feedback timing had little effect on learning phase performance irrespective of the retrieval frequency.
339 In order to examine a possible relationship between the DRE and learning phase performance, the frequency of errors was analysed. (Note that errors here refer to incorrect responses and do not include blank responses.) The analysis showed that 36.2% (22.3%), 24.4% (17.6%), 18.1% (8.8%), and 16.6% (14.0%) of responses during the treatment were errors (SDs in parentheses) in the Retrieval 1, 3, 5, and 7 groups, respectively. The difference among the four groups was statistically significant, F (3, 97) = 7.28, p < .001, η2= .04. The results suggest that repeated retrieval led to fewer errors as intended. This enables us to test the view that delayed feedback may be particularly effective when the treatment induces few errors (e.g., Metcalfe et al., 2009). The results also indicate that the proportion of errors in this study (16.6% - 36.2%) was lower than in Metcalfe et al.'s (2009) experiments (Experiment 1: 40%, Experiment 2: 61%). Hence, this study might produce a larger DRE compared with Metcalfe et al. (2009).
Posttest performance Table 63 and Table 64 summarise the immediate and delayed posttest results as a function of feedback timing and retrieval frequency. The productive and receptive test scores were analysed by a three-way mixed design 2 (feedback timing) x 4 (retrieval frequency group) x 3 (RI) ANOVA (see 5.6.1). As some items were answered correctly on the receptive pretest (Table 51), the pretest scores were subtracted from the posttest scores, and gains were analysed when examining the receptive test results. Table 54 in 5.6.1 shows the results of the ANOVAs. The table indicates the following
340 three things regarding the effects of feedback timing. First, the main effect of feedback timing was not significant on either the productive or receptive posttest regardless of the scoring procedure. This suggests that the timing of feedback had little effect on vocabulary learning when collapsed across the four retrieval frequency groups and the RIs. Second, the interaction between feedback timing and the RI was significant on the receptive posttest with both strict and sensitive scoring, but not on the productive posttest irrespective of the scoring system. Third, none of the other interactions involving feedback timing were significant on any of the dependent variables.
341 Table 63 Number of Correct Responses on the Productive Posttest RI Immediate 1 week 4 weeks Collapsed across the RIs Scoring Groups IMFB DLFB IMFB DLFB IMFB DLFB IMFB DLFB 3.06 2.86 Strict M 4.83 4.63 2.33 2.08 2.00 1.88 Retrieval 1 1.79 1.67 (n = 24) SD 2.28 2.12 2.04 2.00 1.91 1.96 3.35 M 5.25 5.58 2.21 2.46 2.04 2.00 Retrieval 3 3.17 1.82 (n = 24) SD 2.44 2.10 1.84 2.34 1.83 1.72 1.76 5.14 M 7.75 7.75 4.08 3.75 3.75 3.92 Retrieval 5 5.19 1.33 (n = 24) SD 0.44 0.61 1.93 1.82 2.36 2.38 1.38 5.50 M 7.58 7.65 4.73 4.73 3.69 4.12 Retrieval 7 5.33 1.51 (n = 26) SD 1.03 0.80 2.09 2.20 2.20 2.16 1.37 4.24 Total M 6.38 6.43 3.37 3.29 2.89 3.00 4.21 1.94 (n = 98) SD 2.17 2.05 2.24 2.33 2.23 2.29 1.90 3.92 3.81 Sensitive M 5.71 5.63 3.25 2.88 2.79 2.92 Retrieval 1 1.92 1.85 (n = 24) SD 2.22 2.14 2.27 2.31 2.26 2.43 4.31 M 5.92 6.46 2.75 3.33 2.79 3.13 3.82 Retrieval 3 1.76 (n = 24) SD 2.28 1.82 2.15 2.35 1.98 1.80 1.80 6.10 M 7.92 7.83 5.63 5.04 5.13 5.42 6.22 Retrieval 5 1.46 (n = 24) SD 0.28 0.48 1.95 2.16 2.07 2.32 1.25 6.18 M 7.81 7.81 5.65 5.42 5.31 5.31 6.26 Retrieval 7 1.37 (n = 26) SD 0.63 0.57 2.10 1.98 2.43 2.20 1.37 5.12 Total M 6.86 6.95 4.35 4.19 4.03 4.21 5.08 1.91 (n = 98) SD 1.89 1.70 2.48 2.43 2.48 2.47 1.98 Note. The maximum score is 8 for each cell. IMFB = immediate feedback condition; DLFB = delayed feedback condition.
342
Table 64 Number of Correct Responses on the Receptive Posttest RI Scoring Strict
Sensitive
Groups Retrieval 1 (n = 24) Retrieval 3 (n = 24) Retrieval 5 (n = 24) Retrieval 7 (n = 26) Total (n = 98) Retrieval 1 (n = 24) Retrieval 3 (n = 24) Retrieval 5 (n = 24) Retrieval 7 (n = 26) Total (n = 98)
M SD M SD M SD M SD M SD M SD M SD M SD M SD M SD
Immediate IMFB DLFB 5.54 5.88 1.93 1.94 5.63 5.75 2.18 1.87 7.17 7.17 0.82 1.01 7.23 7.50 1.07 0.81 6.41 6.59 1.77 1.65 5.67 6.17 1.97 1.88 5.71 6.08 2.24 1.98 7.50 7.50 0.72 0.78 7.62 7.65 0.64 0.69 6.64 6.87 1.79 1.60
1 week IMFB DLFB 5.21 4.88 2.06 2.19 5.13 5.00 2.15 2.41 6.96 6.54 1.08 1.18 7.15 7.08 0.97 1.23 6.13 5.90 1.88 2.04 5.29 5.04 2.07 2.26 5.13 5.08 2.15 2.41 7.13 7.00 0.95 1.18 7.31 7.23 0.84 1.24 6.23 6.11 1.88 2.10
4 weeks IMFB DLFB 4.75 4.83 1.92 1.79 4.71 4.88 1.92 2.25 6.96 6.67 1.12 1.49 6.54 6.81 1.88 1.83 5.76 5.82 2.00 2.06 4.88 5.00 1.98 1.84 4.75 4.96 1.96 2.29 7.00 6.92 1.06 1.35 6.69 7.00 1.87 1.90 5.85 5.99 2.02 2.10
Collapsed across the RIs IMFB DLFB 5.17 5.19 1.89 1.75 5.21 5.15 1.99 1.78 6.79 7.03 1.06 0.80 7.13 6.97 1.14 0.97 6.10 6.10 1.75 1.69 5.28 5.40 1.93 1.73 5.38 5.19 2.03 1.80 7.14 7.21 0.94 0.68 7.29 7.21 1.16 0.88 6.32 6.24 1.76 1.71
343
As the interaction between feedback timing and the RI proved significant on the receptive posttest, the simple main effect of feedback timing was tested to investigate where the significant differences lay. The results of the simple main effect tests are summarised in Table 65. The table shows that when collapsed across the four retrieval frequency groups, immediate feedback significantly outperformed delayed feedback with strict scoring on the 1-week delayed receptive posttest (p = .031). However, despite statistical significance, only a very small effect size was observed (Δ = 0.13), and the difference in the mean gains between immediate (6.09) and delayed feedback (5.83) was small. The finding is also supported by the relatively narrow confidence intervals of difference: [0.03, 0.51]. The simple main effect of feedback timing was not significant in all other cases (p > .123), yielding very small effect sizes (0.02 < Δ < 0.11). Overall, although a statistically significant effect was found, given the small effect size and narrow confidence intervals of difference, it might be reasonable to assume that feedback timing had little effect on posttest results irrespective of the retrieval frequency or RI. The statistical significance may be partially due to the relatively large cell size (98; Kline, 2004; Norris & Ortega, 2000).
344 Table 65 Results of Simple Main Effect of Feedback Timing on the Receptive Posttest Strict scoring RI
partial
Sensitive scoring
Δ
CI of diff
F
p
.02
0.09
[-0.40, 0.10]
2.43
.123
.031
.05
0.13
[0.03, 0.51]
1.61
.827
.00
0.02
[-0.30, 0.25]
0.71
F
p
Immediate
1.52
.221
1 week
4.81
4 weeks
0.05
2
η
partial
Δ
CI of diff
.02
0.11
[-0.44, 0.05]
.208
.02
0.07
[-0.08, 0.40]
.403
.01
0.06
[-0.37, 0.16]
η2
Note. df = (1, 97). CI of diff = 95% confidence intervals of difference.
It should be noted that the immediate posttest scores of the Retrieval 5 and 7 groups neared the ceiling in both feedback conditions (Table 63 and Table 64). Hence, the lack of a significant feedback timing effect in these two groups on the immediate posttest may be partly ascribed to a possible ceiling effect. On the delayed posttests, however, neither a ceiling nor floor effect was observed. The lack of statistical significance on the delayed posttest, therefore, seems to indicate that feedback timing may have little effect on long-term retention.
Questionnaire In the questionnaire given after the immediate posttest, the participants were asked to evaluate the effectiveness of immediate and delayed feedback on a 7-point scale, where 1 means I learned more with immediate feedback and 7 means I learned more with delayed feedback (see 5.5.8). The average rating (SDs in parentheses) was 3.17 (2.12), 4.25 (1.78), 4.54 (2.04), and 3.46 (1.90) out of 7 for the Retrieval 1, 3, 5, and 7 groups, respectively. When collapsed across the four groups, the average rating was 3.85 (SD = 2.01). No significant difference existed among the four groups in their responses, F (3, 96) = 2.63, p = .054, η2 = .01. The results indicate that (a) learners
345 tended to believe that they learned equally well from the two types of feedback, and (b) the four groups did not differ significantly from each other in their responses. The results appear to be consistent with the posttest results, where feedback timing was found to have little effect on vocabulary learning irrespective of the retrieval frequency.
5.7 Discussion Effects of retrieval frequency The first purpose of this study was to identify the optimal retrieval frequency in flashcard learning. The posttest results showed that the Retrieval 5 and 7 groups significantly outperformed the Retrieval 1 and 3 groups regardless of the posttest, scoring system, and RI, producing large effect sizes (0.88 < d < 2.04). Although the Retrieval 7 group was as effective as the Retrieval 5 group, five retrievals may be more preferable to seven because the latter increased the study time by 27% compared to the former (17.98 minutes vs. 22.82 minutes; see 5.6.1). The results of the questionnaire also indicated that five retrievals may be desirable in terms of learners’ motivation. The questionnaire showed that the participants perceived five retrievals to be significantly more effective than one and three. Five retrievals, therefore, may have a more positive effect on learners’ motivation than fewer retrievals, which may be considered ineffective by learners. At the same time, the efficiency scores showed that the Retrieval 1 group was the most efficient among the four groups. If learners do not have enough time, hence, it may be efficient to practice retrieval only once. Of course,
346 the retrieval frequency needed for acquisition to occur may be affected by a number of factors such as the difficulty of target words or learners’ memory capacity, which awaits future research.
The findings of this study appear to be inconsistent with Rohrer et al. (2005). In their Experiment 1, although the Retrieval 20 group significantly outperformed the Retrieval 5 group 1 week after the treatment, there was no statistically significant difference between the two groups on the 3- and 9-week delayed posttests. Similarly, in their second experiment, while the Retrieval 10 group fared significantly better than the Retrieval 5 group 1 week after the treatment, no statistically significant difference existed between the two groups on the 4-week delayed posttest (see 5.1). Based on these findings, Rohrer et al. argue that the benefits of repeated retrieval may be short lived. The present study, however, indicated that the advantage of repeated retrieval may persist at least 4 weeks after the treatment. Based on the results of this study, it may be useful to reconsider the value of repeated retrieval for flashcard learning.
The inconsistent findings between Rohrer et al. (2005) and the present study may be partially due to at least two differences in the methodology. First, while the present study looked into L2 vocabulary learning, Rohrer et al. investigated the learning of city-country pairs (e.g., Chiba - Japan; Experiment 1) or L1 word pairs (e.g., acrogen – fern; Experiment 2). Because of this difference, Rohrer et al.’s findings may not necessarily be applicable to this study. Second, although Rohrer et al. examined the
347 effects of only five or more retrievals (five and 20 in Experiment 1 and five and 10 in Experiment 2), the present study used frequency levels that are lower than five (one and three) as well as five or higher (five and seven). As a result, the current study might have allowed us to obtain a more comprehensive picture regarding the effects of retrieval frequency than Rohrer et al. (2005). If Rohrer et al. had also used frequency levels that are lower than five, they might have found a significant advantage of repeated retrieval at an RI greater than 1 week.
Next, let us consider how the posttest scores changed as a function of the RI. On the productive posttest, when collapsed across the four retrieval frequency groups, the immediate posttest scores were significantly higher than both the 1- and 4-week delayed posttest scores, producing large effect sizes (1.14 < Δ < 1.61). Yet, the differences between the 1-week and 4-week RIs were smaller, and only very small effect sizes were observed (0.06 < Δ < 0.18). On the receptive posttest, when collapsed across the four groups, the immediate test scores were also significantly higher than both the 1-week and 4-week delayed test scores, producing small effect sizes (0.26 < Δ < 0.42). The differences between the 1- and 4-week RIs, however, were not statistically significant (p > .052), and only very small effect sizes were found (0.12 < Δ < 0.13). The results appear to be consistent with the finding that most forgetting occurs immediately after learning (e.g., Anderson & Jordan, 1928; Bahrick, 1984; Cepeda et al., 2008; Rawson & Dunlosky, 2011; Rohrer et al., 2005, Experiment 1; Seibert, 1927, 1930, 1932). At the same time, they are at odds with
348 Rohrer et al.’s (2005) Experiment 2. In their experiment, the posttest scores dropped significantly between the 1-week (Retrieval 5: 38%; Retrieval 10: 64%) and 4-week RIs (Retrieval 5: 18%; Retrieval 10: 22%).
The inconsistent findings between Rohrer et al. (2005, Experiment 2) and the current study may be partly ascribed to a difference in the experimental design. In the present study, the RI was a within-participant variable, and each participant sat posttests at three RIs: immediately, 1 week, and 4 weeks after the treatment. As correct responses in the productive test were used as cues in the receptive test, and correct responses in the receptive test were used as cues in the productive test, earlier tests might have affected performance on later tests, possibly diminishing a potential difference between the 1-week and 4-week RIs. In Rohrer et al. (2005), in contrast, the RI was a between-participant variable, and participants took either the 1- or 4-week delayed posttest (see 5.1). Because participants were not exposed to the target materials during the period between the treatment and 4-week delayed posttest, the differences between the 1- and 4-week RIs might have been larger in Rohrer et al. than in the present study.
The present study also indicated that the RI may have differential effects on the productive and receptive test scores. On both productive and receptive posttests, the immediate posttest scores were significantly higher than the 1-week and 4-week delayed posttest scores. However, although large effect sizes were found on the
349 productive posttest between the immediate and both delayed posttests (1.14 < Δ < 1.61), only small effect sizes were observed on the receptive test (0.26 < Δ < 0.42). The results may be partially due to the test order. Because the productive posttest was given prior to the receptive test at each test administration (see 5.5.7), the productive test might have affected performance on the receptive test, possibly diminishing a potential difference among the three RIs on the receptive test.
The posttest results showed that regardless of the posttest, scoring system, and RI, the Retrieval 5 and 7 groups significantly outperformed the Retrieval 1 and 3 groups, producing large effect sizes (0.88 < d < 2.04). However, no statistically significant difference existed between the Retrieval 1 and 3 groups on one hand and between the Retrieval 5 and 7 groups on the other, yielding no or small effect sizes (d < 0.45). The results might suggest that learning may not occur incrementally but in an all-or-none fashion (the one-trial learning controversy; see Estes, 1960; Rock, 1957; Underwood & Keppel, 1962).
Effects of feedback timing The second purpose of this study was to identify the optimal feedback timing in flashcard learning. Unlike some previous studies, the immediate and delayed feedback conditions in this study were controlled for lag to test. In order to examine a possible relationship between the DRE and the frequency of errors during the treatment, retrieval frequency was also manipulated. The results of this study suggested that
350 when lag to test is controlled, feedback timing may have little effect on vocabulary learning regardless of the frequency of errors during the treatment. Although the experimental settings in this study were closer to those in laboratory studies, which are more likely to find the superiority of delayed over immediate feedback than classroom studies (Butler et al., 2007; Kulik & Kulik, 1988; Roediger et al., 2010), a significant DRE was not observed. Taken together, the results suggest that the benefits of delayed feedback at the intervals used in this study may be limited as far as L2 vocabulary learning is concerned. Pedagogically, the results imply that when learning from flashcards, learners may study with either immediate or delayed feedback. The finding may translate well to paper-based flashcard learning, where delayed feedback may be relatively hard to implement. Immediate feedback may be used in paper-based flashcard learning because it may be as effective as delayed feedback and easier to implement manually. The use of immediate feedback may also have a positive effect on learners’ motivation because learners seem to prefer immediate to delayed feedback (Karpicke, Smith, et al., 2009; see 5.2).
The present study also indicates that in flashcard learning, retrieval frequency might be more important a factor than the number of encounters with target words. Note that delayed feedback increased the number of encounters relative to immediate feedback at least for correctly recalled items. This is because successful retrieval followed by delayed feedback may be considered as two separate encounters with a target word, whereas in the immediate feedback condition, the retrieval and feedback opportunities
351 are massed into one study event. Despite the increased number of encounters, however, delayed feedback failed to significantly outperform immediate feedback. The results, together with those of earlier studies (e.g., Karpicke & Roediger, 2007b, 2008; Karpicke, 2009), may suggest that retrieval frequency may have a larger effect on learning than the number of encounters.
One caveat to be considered is that the retrieval frequency was confounded with the study time in the current study, whereas the number of encounters was not. In other words, although the Retrieval 5 and 7 groups significantly outperformed the Retrieval 1 and 3 groups, the study time for the former was also significantly longer (see 5.6.1). As a result, superior learning in the Retrieval 5 and 7 groups may be partly attributed to increased study time rather than repeated retrieval per se. In contrast, although delaying feedback increased the number of encounters, the study time was roughly equivalent in the immediate and delayed feedback conditions (see Table 61). If the delayed feedback condition had also involved a longer study time than the immediate feedback condition, the former might have outperformed the latter.
Let us now consider the theoretical implications of this study. As discussed in 5.2, there exist conflicting views about the effectiveness of immediate and delayed feedback. On one hand, delayed feedback is considered more effective because it may introduce larger spacing as well as cause less interference than immediate feedback. On the other hand, according to the theory of errorless learning, feedback needs to be
352 given immediately after retrievals because otherwise, learners’ errors might be consolidated.
The present study did not find any significant difference between immediate and delayed feedback. The lack of a significant feedback timing effect was caused possibly because the beneficial effects of delaying feedback (larger spacing and less interference) might have been offset by the risk of not correcting an error immediately (see 5.2). It should be noted, however, that a significant DRE was not observed in this study although the proportion of errors (16.6% - 36.2%) was lower than in Metcalfe et al. (2009) 's experiments (40% - 61%). The interaction between feedback timing and the retrieval frequency group was not significant either although repeated retrieval was associated with fewer errors. The finding seems to be inconsistent with the observation that when the treatment induces only few errors, delayed feedback may be more effective because of the distributed practice effect (Metcalfe et al., 2009; see 5.2). The results may suggest that the findings of Metcalfe et al.'s experiments may be at least partly attributed to the difference in the age of participants (i.e., grade school vs. college students) rather than the difference in the frequency of errors per se.
Alternatively, the conflicting results might be due in part to at least two methodological differences between the present study and Metcalfe et al. (2009). First, while Metcalfe et al. investigated the learning of L1 vocabulary, the present study looked into L2 vocabulary learning. Second, although delayed feedback was given
353 after a delay of 1 day or longer in Metcalfe et al., it was provided 92.77 seconds on average after the response in this study (see 5.6.2). Because larger spacing generally leads to better long-term retention than shorter spacing (lag effect; see 4.1.1), a significant DRE might have been observed if the delayed feedback condition had used a much longer ISI. Future research may provide delayed feedback after a longer delay in order to explore this possibility. It should be noted, however, that some previous studies have found a significant DRE using a much shorter delay than in this study (see Kulik & Kulik, 1988, for a review). Carpenter and Vul (2011), for instance, provided delayed feedback 3 seconds after the response and found the advantage of delayed over immediate feedback.
5.8 Limitations One limitation of the present study may be a possible ceiling effect on the immediate posttests in the Retrieval 5 and 7 groups (see 5.6.1 and 5.6.2). The lack of statistical significance between the two groups as well as the nonsignificant feedback timing effect in these two groups on the immediate posttest may be partly ascribed to a possible ceiling effect. Second, the RI was a within-participant variable in this study, and each participant sat posttests at three RIs: immediately, 1 week, and 4 weeks after the treatment. Because correct responses in the productive test were used as cues in the receptive test, and correct responses in the receptive test were used as cues in the productive test, earlier tests might have affected performance on later tests. In future research, in order to reduce potential learning effects from posttests, the RI may be
354 manipulated between participants (e.g., Karpicke & Roediger, 2007a; Logan & Balota, 2008; Rohrer et al., 2005). Third, the lack of a significant feedback timing effect in the current study may be partly ascribed to the rather short interval between the retrieval attempt and delayed feedback (92.77 seconds on average; see 5.6.2). A significant DRE might have been observed if the delayed feedback condition had used a much longer ISI. Future research may provide delayed feedback after a longer delay in order to explore this possibility.
355 Chapter 6. GENERAL DISCUSSION AND CONCLUSIONS The present chapter will summarise the findings of Studies 1-4 and discuss the optimal way to learn from flashcards. The chapter will also consider the implications of this thesis when taken as a whole. Finally, it will present the limitations of the present thesis and discuss directions for further research.
6.1 Review of the Findings 6.1.1 Study 1 Study 1 (Chapter 2) examined the effects of block size (BS) on L2 vocabulary learning. Study 1 consisted of two experiments: Experiments 1A and 1B. Experiment 1A compared the effects of BSs of four, 10, and 20 words. Unlike previous studies, the three BSs were matched in spacing. Experiment 1A indicated that there may be little difference among the three BSs in their effectiveness. Experiment 1B compared the following three treatments: a BS of 20 words (BS 20 treatment), a BS of four words with equivalent spacing as the BS of 20 words (BS 4 treatment), and a BS of four words with shorter spacing than the BS of 20 words (control treatment). Experiment 1B demonstrated the superiority of the BS 4 and 20 treatments over the control.
Taken together, Experiments 1A and 1B indicate that (a) as long as spacing is equivalent, BS may have little effect on learning (hence, BS 4 = 10 = 20 in Experiment 1A and BS 4 = 20 in Experiment 1B), and (b) spacing may have a larger
356 effect on learning than BS (hence, BS 4 = 20 > control in Experiment 1B). The findings are significant because they suggest that the results of the earlier studies may be at least partly attributed to spacing rather than BS per se. Pedagogically, the findings indicate that (a) introducing a large amount of spacing between encounters may be more important than using a particular BS, and (b) there is no magic BS and learners should study with what they are comfortable with. Researchers, teachers, learners, and materials developers seem to believe that a small BS may enhance learning more than a large one (Joseph et al., 2009; Kornell, 2009; Salisbury & Klein, 1988; Van Bussel, 1994; Waring, 2004; Wissman et al., 2012; Woodworth & Schlosberg, 1954). Based on the results of Study 1, however, it may be useful to pay more attention to spacing rather than BS.
Although BS was found to have little effect on posttest performance, Study 1 nonetheless suggested possible advantages and disadvantages of using different BSs. One benefit of a small BS may be that it might increase learning phase performance and thus motivate learners. A disadvantage of using a small BS may be that it may possibly lead to under-learning. That is, a high probability of retrieval success caused by a small BS may create what Kornell (2009) refers to as ‘an illusion of effective learning’ (p. 1302), and learners may stop studying before lexical items are actually acquired. Ideal flashcard software would keep a record of the learner’s performance on individual items and ensure that under-learning would not occur.
357 6.1.2 Study 2 Study 2 (Chapter 3) investigated the effects of retrieval formats on flashcard learning. The following four conditions were compared: recognition, recall, combined, and highest difficulty. The pacing of feedback was also manipulated. For the computer-paced group, the feedback duration was fixed to 5 seconds per response. The self-paced group was allowed to close the feedback window before 5 seconds elapsed. The results of Study 2 suggested that (a) for the acquisition of knowledge of orthography, L2 words may need to be practised in a productive recall format at least twice, (b) for the acquisition of form-meaning connections, recognition formats may be more desirable than recall, and (c) computer-paced feedback may improve both learning phase and posttest performance compared with self-paced feedback.
One of the most interesting findings in Study 2 was that it failed to find any advantage of recall on the recognition posttest. The result may be surprising because it is at odds with the existing studies on reading, which showed the superiority of recall over recognition on a recognition posttest (Foos & Fisher, 1988; Glover, 1989; Kang et al., 2007; McDaniel et al., 2007). The finding may be particularly valuable because existing flashcard programs seem to offer limited capabilities regarding multiple-choice exercises (see 3.6). Based on the results of Study 2, it may be useful to reconsider the value of recognition for vocabulary learning when designing flashcard software or vocabulary exercises. Study 2 also suggested that computer-paced feedback might be more effective than self-paced feedback. The
358 result may have important pedagogical implications because self-paced feedback seems to be a common feature among existing flashcard programs (see 3.6). For paper-based flashcard learning, the finding indicates that it may be useful to raise awareness that spending several extra seconds viewing the correct response may potentially increase learning.
6.1.3 Study 3 Study 3 (Chapter 4) examined the effects of absolute and relative spacing on L2 vocabulary learning. Results suggested that relative spacing may have little effect on learning regardless of absolute spacing or the RI. The main effect of absolute spacing, however, was significant. Multiple comparisons showed that massed learning may be the least effective regardless of the posttest, scoring system, or RI. Contrary to the predictions of the lag effect (e.g., Bird, 2010; Cepeda et al., 2006, 2008, 2009; Pashler et al., 2007; Rohrer & Pashler, 2007; see 4.1.1), no statistically significant difference existed among the three experimental groups (short, medium, and long spacing) in their effectiveness. A unique contribution of Study 3 is that the results indicated that when feedback is provided, expanding spacing may not necessarily enhance L2 vocabulary learning even though (a) the treatment involves long absolute spacing, (b) the task difficulty is high, and (c) the RI is shorter than 24 hours, the conditions in which the expanding retrieval effect has traditionally been observed (e.g., Balota et al., 2007; Cull et al., 1996; Dobson, 2012; Karpicke & Roediger, 2007a; Logan & Balota, 2008; Maddox et al., 2011; Storm et al., 2010).
359
Although many applied linguists as well as psychologists regard expanding spacing as the most effective relative spacing schedule (e.g., Baddeley, 1997, pp. 112–114; Ellis, 1995; Hulstijn, 2001; Nation, 2001, pp. 76–77; Roediger & Karpicke, 2010; Schmitt & Schmitt, 1995; Schmitt, 2000, 2007), Study 3 found little difference between equal and expanding spacing in their effectiveness. The result seems to imply that when learning from flashcards, learners may study with either equal or expanding spacing. Equal spacing may be used in paper-based flashcard learning because it may be as effective as expanding spacing and easier to implement manually.
Another implication of Study 3 is that it may be useful to introduce spacing between encounters because massed learning turned out to be the least effective regardless of the posttest, scoring system, or RI. At the same time, no significant difference existed among the three experimental groups in their posttest scores. The result is incongruent with the lag effect (see 4.1.1), according to which larger absolute spacing (e.g., an ISI of 10 days) generally leads to better long-term retention than shorter spacing (e.g., an ISI of 1 day). The finding may suggest that depending on the range of ISIs, larger absolute spacing does not necessarily lead to better long-term retention than shorter spacing. In other words, although an ISI of 1 day may be significantly more effective than an ISI of 5 minutes (Cepeda et al., 2009, Experiment 1), for instance, there may be only a small difference between the effects of 5-minute and 10-minute ISIs (see Crothers & Suppes, 1967, Experiment 11; Logan & Balota, 2008; Pyc & Rawson,
360 2007, for similar findings).
Although no statistically significant difference existed among short, medium, and long spacing, medium spacing may be the most desirable among the three experimental treatments because (a) it may be slightly more effective than short spacing, and (b) it may increase learning phase performance compared with long spacing and thus motivate learners. If we assume that medium spacing was the most desirable, a prescriptive conclusion may be that learners should use a mean ISI of around 2 minutes (117.00 seconds in medium spacing) in flashcard learning. Together with the findings related to relative spacing, the results suggest that absolute spacing may have a larger effect on L2 vocabulary learning than relative spacing (Karpicke & Bauernschmidt, 2011). Pedagogically, the findings indicate that introducing a large amount of spacing may be more important than gradually increasing spacing.
6.1.4 Study 4 Study 4 (Chapter 5) investigated the effects of retrieval frequency and feedback timing on flashcard learning. The posttest results showed that the Retrieval 5 and 7 groups significantly outperformed the Retrieval 1 and 3 groups regardless of the posttest, scoring system, and RI. Although no significant difference existed between the Retrieval 5 and 7 groups, five retrievals may be more preferable to seven because the latter increased the study time by 27% compared to the former (17.98 minutes vs. 22.82 minutes). At the same time, the efficiency scores showed that the Retrieval 1
361 group was the most efficient among the four groups.
Pedagogically, the findings indicate that (a) it may be most desirable to practice retrieval five times, and (b) if learners do not have enough time, it may be efficient to practice retrieval only once. One significant finding of Study 4 was that it demonstrated the advantage of repeated retrieval 4 weeks after the treatment. The finding is valuable because previous research has shown that the benefits of repeated retrieval may be short lived (Rohrer et al., 2005). Based on the results of Study 4, it may be useful to reconsider the value of repeated retrieval for flashcard learning.
Although some non-language learning studies have found a significant DRE (delay-retention effect), or the advantage of delayed over immediate feedback (e.g., Butler et al., 2007; Carpenter & Vul, 2011; Kulik & Kulik, 1988; Mory, 2004; Roediger et al., 2010), delaying feedback did not significantly increase learning in Study 4. A significant DRE was not observed although the proportion of errors (16.6% - 36.2%) in Study 4 was lower than in Metcalfe et al. (2009) 's experiments (40% - 61%). The interaction between feedback timing and the retrieval frequency group was not significant either although repeated retrieval was associated with fewer errors. The finding seems to be inconsistent with the observation that when the treatment induces only few errors, delayed feedback may be more effective than immediate feedback (Metcalfe et al., 2009). The results seem to suggest that the benefits of delayed feedback at the intervals used in this study may be limited as far as
362 L2 vocabulary learning is concerned.
Pedagogically, the results of Study 4 indicate that learners may study with either immediate or delayed feedback in flashcard learning. Immediate feedback may be used in paper-based flashcard learning because it may be as effective as delayed feedback and easier to implement manually. Another implication of Study 4 is that in flashcard learning, retrieval frequency might be more important a factor than the number of encounters with target words. As pointed out in 5.7, delayed feedback increased the number of encounters relative to immediate feedback at least for correctly recalled items. This is because successful retrieval followed by delayed feedback may be considered as two separate encounters with a target word, whereas in the immediate feedback condition, the retrieval and feedback opportunities are massed into one study event. Despite the increased number of encounters, however, delayed feedback failed to significantly outperform immediate feedback. The results, together with those of earlier studies (e.g., Karpicke & Roediger, 2007b, 2008; Karpicke, 2009), may suggest that retrieval frequency may have a larger effect on learning than the number of encounters.
6.2 Overall Discussion Now let us consider the implications of this thesis when taken as a whole. First, the present thesis demonstrated the importance of spacing in flashcard learning. The results of Study 1 suggested that spacing may have a larger effect on learning than BS.
363 In Study 3, massed learning was the least effective regardless of the posttest, scoring system, or RI. Medium and long spacing were more than twice as effective as massed learning on the delayed productive posttest. These findings underscore the importance of spacing in L2 vocabulary learning. The results may have important pedagogical implications because although research shows that spacing may have a large effect on learning (e.g., Cepeda et al., 2006, 2009; Dempster, 1989; Janiszewski et al., 2003), its benefits have not been exploited fully in traditional instructional settings (Cepeda et al., 2009; Dempster, 1988; Rohrer et al., 2005; Sobel et al., 2011). Furthermore, learners are often unaware that spacing increases learning (Hartwig & Dunlosky, 2011; Kornell, 2009; Son & Simon, 2012; Wissman et al., 2012). Based on the results of the present research, it may be useful to raise awareness of the importance of spacing.
At the same time, Studies 1 and 3 suggested that depending on the range of ISIs, larger absolute spacing does not necessarily lead to better long-term retention than shorter spacing. In Experiment 1B, spacing of 195.65 (BS 20 treatment) - 198.47 (BS 4 treatment) seconds significantly outperformed that of 32.42 seconds (control treatment). In Study 3, however, no statistically significant difference existed among the following three mean ISIs: 58.97 (short), 117.00 (medium), and 355.91 seconds (long). Taken together, the results may suggest that (a) there may be a threshold beyond which the benefits of increasing spacing diminish, and (b) the threshold may lie somewhere between 32.42 (control treatment in Experiment 1B) and 58.97 seconds
364 (short spacing in Study 3). Nonetheless, because there were some methodological differences between Studies 1 and 3 (e.g., BS, relative spacing, and the number of filler items), the findings of the two studies may not necessarily be directly comparable, and further research is warranted.
Second, the current thesis also suggested that practising retrieval in a difficult and effortful condition (e.g., increasing spacing between encounters and using a demanding format such as productive recall) may enhance learning. In Experiment 1B, the control treatment, which produced the largest number of correct responses during retrieval practice, turned out to be the least effective 1 week after the treatment. The results of Study 2 indicated that the highest difficulty treatment may enhance learning although it may decrease learning phase performance. Similarly, although the massed group in Study 3 led to the best learning phase performance, it was the least effective on both the immediate and delayed posttests. The results seem to be consistent with the desirable difficulty framework (Bjork, 1994, 1999; Schmidt & Bjork, 1992; see 3.1.2), according to which a treatment that increases initial rate of acquisition does not always enhance long-term retention. Pedagogically, the findings indicate that (a) practising retrieval in a difficult and effortful condition may increase learning, and (b) it may be useful to raise awareness that making mistakes during learning is not necessarily a sign of ineffective learning (e.g., Bjork, 1994, 1999; Pashler et al., 2003; Schmidt & Bjork, 1992).
365 Third, the present thesis has yielded mixed results regarding learners’ metacognitive knowledge. Metacognition literature has shown that learners tend to have misconceptions about what constitutes an effective learning technique (e.g., Hagemeier & Mason, 2011; Hartwig & Dunlosky, 2011; Karpicke, Butler, et al., 2009; Karpicke, 2009; Kornell & Bjork, 2008; Kornell, 2009; Wissman et al., 2012; see Chapter 1). Although this finding was supported by Study 2, it was incongruent with the results of Studies 3 and 4. In Study 2, the questionnaire showed that the participants perceived the four types of retrieval formats to be effective in the following order: receptive recognition = productive recognition > receptive recall > productive recall. The results are at odds with the posttest results because (a) the recall condition fared significantly better than the recognition condition on the productive recall posttest, and (b) no significant difference was detected between the recall and recognition conditions on the other three posttests.
In Studies 3 and 4, however, the questionnaire results were consistent with the posttest results. The questionnaire in Study 3 showed that the participants (a) perceived the small and medium spacing schedules to be more effective than the massed schedule and (b) considered equal and expanding spacing to be equally effective, both of which were supported by the posttest results. In Study 4, the questionnaire indicated that the participants (a) perceived five and seven retrievals to be more effective than one and three retrievals and (b) considered immediate and delayed feedback to be equally effective. Both of these perceptions were, once again, congruent with the posttest
366 results. Overall, the results suggest that contrary to the findings of earlier metacognition research, the participants in Studies 3 and 4 might not necessarily have had misconceptions about what constitutes an effective learning technique.
The inconsistent results may have been partly due to a difference in participants. Although most existing metacognition studies have been conducted with American university students (e.g., Hagemeier & Mason, 2011; Hartwig & Dunlosky, 2011; Karpicke, Butler, et al., 2009; Karpicke, 2009; Kornell & Bjork, 2008; Kornell, 2009; Wissman et al., 2012), Studies 3 and 4 in this thesis, where the questionnaire results were consistent with the posttest results, were conducted with Japanese technical college and university students. Because Japanese students tend to be more experienced in and proficient at rote memorisation than American students (Tinkham, 1989), the participants of Studies 3 and 4 might have been more familiar with effective learning strategies than suggested by earlier studies. This may be partly the reason why the results of Studies 3 and 4 were at odds with those of the existing metacognition literature. However, due to the paucity of metacognition research on Japanese students, further research is warranted.
Fourth, the present thesis suggested that it may be useful to survey the cognitive psychology literature when researching L2 vocabulary acquisition. Although the present study is an applied linguistics one, most studies reviewed in the thesis turned out to be those conducted in cognitive psychology: BS (Brown, 1924; Crothers &
367 Suppes, 1967; Kornell, 2009; McGeoch, 1931; Van Bussel, 1994), retrieval formats (e.g., Bjork & Whitten, 1974; Butler & Roediger, 2007; Carpenter & DeLosh, 2006; Kang et al., 2007; Van Bussel, 1994), absolute spacing (e.g., Bird, 2010; Cepeda et al., 2006, 2008, 2009; Pashler et al., 2007; Rohrer & Pashler, 2007), relative spacing (e.g., Balota et al., 2006; Carpenter & DeLosh, 2005; Cull et al., 1996; Cull, 2000; Karpicke & Roediger, 2007a; Landauer & Bjork, 1978; Logan & Balota, 2008; Shaughnessy & Zechmeister, 1992), retrieval frequency (Karpicke & Roediger, 2007a, 2007b; Logan & Balota, 2008; Rohrer et al., 2005; Tulving, 1967; Zaromb & Roediger, 2010), and feedback timing (e.g., Butler et al., 2007; Carpenter & Vul, 2011; Kulik & Kulik, 1988; Metcalfe et al., 2009). Although most of the above studies are rarely cited by L2 vocabulary researchers (exceptions may include Crothers & Suppes, 1967, Landauer & Bjork, 1978, and Van Bussel, 1994), they nonetheless provided useful starting points for this thesis. Future research may conduct a more comprehensive survey of the cognitive psychology literature because it may potentially advance our understanding of not only flashcard learning but also other forms of vocabulary learning.
At the same time, this thesis also suggested that the findings of the cognitive psychology studies may not necessarily be applicable to L2 vocabulary learning. For instance, although previous psychology research has found the superiority of recall over recognition on a recognition posttest (Foos & Fisher, 1988; Glover, 1989; Kang et al., 2007; McDaniel et al., 2007), Study 2 failed to find any advantage of recall on
368 the recognition posttests. Similarly, whereas the benefits of expanding over equal spacing (Cull et al., 1996; Dobson, 2011; Karpicke & Roediger, 2007a; Landauer & Bjork, 1978; Logan & Balota, 2008; Maddox et al., 2011; Storm et al., 2010) and delayed over immediate feedback (e.g., Butler et al., 2007; Kulik & Kulik, 1988; Metcalfe et al., 2009) have been observed by existing psychology studies, neither expanding spacing nor delayed feedback significantly increased learning in the present research.
The contradictory results between the present study and cognitive psychology research may be ascribed in part to at least two methodological differences. First, most psychology studies have investigated the learning of materials other than L2 vocabulary such as L1 vocabulary (e.g., Balota et al., 2006; Bjork & Whitten, 1974; Carpenter & DeLosh, 2006; Karpicke & Roediger, 2007a, 2007b; Logan & Balota, 2008; Maddox et al., 2011; Metcalfe et al., 2009; Rohrer et al., 2005), reading materials (e.g., Butler et al., 2007; Duchastel, 1981; Foos & Fisher, 1988; Glover, 1989; Kang et al., 2007; Karpicke & Roediger, 2010; McDaniel et al., 2007; Storm et al., 2010), general facts (Cepeda et al., 2008, 2009; Cull et al., 1996; Shaughnessy & Zechmeister, 1992), and face-name pairs (Carpenter & DeLosh, 2005; Carpenter & Vul, 2011; Cull et al., 1996; Landauer & Bjork, 1978). Because the learning of these materials may involve a different process compared with that of L2 vocabulary, some of the findings in the present thesis might have been inconsistent with those of existing psychology studies.
369
Second, as pointed out in Chapters 3-5, feedback has not been given in some psychology studies (e.g., Bjork & Whitten, 1974; Carpenter & DeLosh, 2006; Cull et al., 1996; Dobson, 2011; Karpicke & Roediger, 2007a; Landauer & Bjork, 1978; Logan & Balota, 2008; Maddox et al., 2011; Storm et al., 2010). In this thesis, however, feedback was provided after retrievals in order to increase learning and ecological validity. Because the effects of treatments may sometimes interact with the absence or presence of feedback (Balota et al., 2007; Cepeda et al., 2006; Cull et al., 1996; Kang et al., 2007; Storm et al., 2010; see Chapters 2 and 4), the findings of the previous psychology research might not have always been applicable to the present study. The above discussion suggests that although the cognitive psychology literature might provide some useful information, we may need to be cautious about making recommendations about L2 vocabulary learning based solely on psychology studies, which tend to employ different methodologies from SLA research. It may be useful for applied linguists to empirically test the findings of the cognitive psychology studies in the context of L2 vocabulary learning, which the present thesis has attempted to do (see Barcroft, 2007, and Bird, 2012, for examples of similar attempts).
Methodological implications Next, let us consider the implications of this thesis for research methodology. First, the thesis has shown the importance of controlling for possible confounding variables
370 in empirical research. For instance, the results of Study 1 showed that although a large BS is more effective than a small one when spacing is confounded, there is no difference between the two when they have equivalent spacing. The findings seem to demonstrate the need to control for spacing when comparing the effects of different treatments. Otherwise, differential spacing could possibly offset the effects of manipulations that researchers are interested in, which seems to be the case with earlier BS studies (Brown, 1924; Crothers & Suppes, 1967; Kornell, 2009; McGeoch, 1931; Seibert, 1932). Similarly, this thesis has shown the value of isolating the effects of BS and lag to test (Study 1), relative and absolute spacing (Study 3), retrieval frequency and exposure frequency (Study 4), retrieval frequency and item difficulty (Study 4), and feedback timing and lag to test (Study 4), which were often confounded in existing research.
The present thesis also provided some useful information regarding dependent measures to be used in empirical research. First, it indicated that superior learning phase performance is not always associated with better posttest performance (see 6.1). The finding suggests that it might not be valid to use learning phase performance as an index of long-term retention (Bjork, 1994, 1999; Ellis, 1995; Schmidt & Bjork, 1992). For instance, some empirical studies have used trials to criterion, or the number of trials required for learners to successfully recall a target item a given number of times during learning, as dependent measures (e.g., Higa, 1963; Tinkham, 1993, 1997; Waring, 1997b). However, considering that a treatment that maximises
371 learning phase performance does not necessarily lead to superior long-term retention, other types of dependent variables such as posttest scores might be a more direct and valid measure of learning outcomes.
Second, when examining the effects of instructional interventions, it may be useful to give a posttest at a relatively long RI such as 4 weeks. In Study 4, posttests were administered at three RIs: immediately, 1 week, and 4 weeks after the treatment. By using a 4-week RI, Study 4 showed that the effects of repeated retrieval may be more durable than indicated by earlier research (Rohrer et al., 2005). Although the present thesis demonstrated the value of a long RI, most existing studies on paired-associate learning have used a relatively short RI: less than 24 hours (e.g., Balota et al., 2006; Carpenter & DeLosh, 2005; Cull et al., 1996; Landauer & Bjork, 1978; Maddox et al., 2011; Pyc & Rawson, 2007; Shaughnessy & Zechmeister, 1992; S. A. Webb, 2009a, 2009b), 1 day (Crothers & Suppes, 1967; Logan & Balota, 2008; McGeoch, 1931), 2 days (Karpicke & Roediger, 2007a), 1 week (e.g., Karpicke & Bauernschmidt, 2011; Karpicke & Roediger, 2008; Pashler et al., 2003; Pyc & Rawson, 2009, 2011; Van Bussel, 1994), 10 days (Cepeda et al., 2009, Experiment 1), 2 weeks (Mondria & Wiersma, 2004), and 3 weeks (Steinel et al., 2007). In future research, it may be useful to use a longer RI and investigate long-term effects (see Anderson & Jordan, 1928; Kang et al., 2013; Seibert, 1927, 1930, 1932; Waring, 1997a, for examples of studies using an RI of 4 weeks or greater).
372 Third, this thesis underscored the value of measuring both receptive and productive knowledge with different sensitivities (e.g., Barcroft & Rott, 2010; Thomas & Dieter, 1987; S. A. Webb, 2002, 2005, 2007a, 2007b, 2012). In Studies 1-4, learning was measured by both productive and receptive posttests, and responses were scored at two levels of sensitivity: strict and sensitive (see 2.2.11). In this thesis, receptive and productive tests often produced inconsistent results. In Experiment 1B, for instance, although the BS 4 and 20 groups significantly outperformed the control group on the productive posttest, no significant difference existed on the receptive test. In Study 2, the recall condition fared significantly better than the recognition and combined conditions on the productive recall test, but not on the receptive recall posttest. Different scoring methods also yielded contrasting results. In Study 2, although the combined condition was significantly more effective than the recognition condition with strict scoring on the productive recall test, there was no significant difference between the two with sensitive scoring. Similarly, the highest difficulty condition in Study 2 significantly outperformed the recognition and combined conditions with strict scoring on the productive recall test, but not with sensitive scoring. These findings indicate that it may be useful for empirical studies to measure both receptive and productive vocabulary knowledge with different sensitivities.
Fourth, the sensitive scoring method employed in this thesis for a productive recall test may be useful for future studies. The present thesis developed a new sensitive scoring method based on a lexical production scoring protocol (LPSP; see 2.2.11).
373 Our sensitive scoring system may be preferable to those used in previous research in at least four respects (see 2.2.11). First, our sensitive scoring method does not involve subjective judgement and is objective and replicable. Second, our sensitive scoring system takes account of differences in word length unlike other procedures (Thomas & Dieter, 1987; Waring, 1997a) and may be affected by word length to a lesser extent. Third, because it is less lenient than LPSP, our sensitive scoring method might have a smaller chance of giving credit for wild guessing. Fourth, unlike LPSP, our sensitive scoring procedure produces interval data, which is amenable to parametric tests such as ANOVA. Due to the above advantages, the sensitive scoring method proposed in this thesis may be used by future studies. The computer program developed for this thesis to automatically score responses on the productive test may also be made available for future research (see Rogers, Webb, & Nakata, 2013, for an example of other research using the sensitive scoring protocol and computer program developed in this thesis).
Lastly, the protocol developed in this thesis to determine the retrieval cues might be useful for future research. In Studies 1, 3, and 4, one letter in the target word and the number of letters in the word (e.g., _ _ n _ for mane) were given in the productive pretest in order to prevent learners from providing synonyms. Studies 1, 3, and 4 suggested that the retrieval cues used in this thesis might have been useful because they effectively prevented participants from providing synonyms for a target word. The protocol developed in this thesis to determine the retrieval cues, therefore, may be
374 used by future research (see Rogers et al., 2013, for an example of other research using the same protocol).
6.3 Pedagogical Implications The purpose of this thesis was to investigate how we can optimise L2 vocabulary learning from flashcards, a form of intentional vocabulary learning. We might be able to make the following recommendations based on the results from the four studies in this thesis: 1. Introducing a large amount of spacing between encounters may be more important than using a particular BS, especially for improving long-term retention (Study 1). 2. One advantage of using a small BS may be that it may increase learning phase performance and thus motivate learners. A disadvantage may be that it may result in under-learning because it may create an illusion of successful learning (Study 1). 3. For the acquisition of knowledge of orthography, L2 words may need to be practised in a productive recall format at least twice (Study 2). 4. For the acquisition of form-meaning connections, recognition formats may be more desirable than recall (Study 2). 5. Computer-paced feedback may be used in flashcard software because it might be more effective than self-paced feedback. In paper-based flashcard learning, spending several extra seconds viewing the correct response may potentially increase learning (Study 2).
375 6. Learners should introduce spacing between encounters of a given item because spacing increases learning (Study 3). 7. Learners should use a mean ISI of around 2 minutes in flashcard learning (Study 3). 8. Learners may study with either equal or expanding spacing. Equal spacing may be used in paper-based flashcard learning because it may be easier to implement manually (Study 3). 9. It may be most desirable to practice retrieval five times. The advantage of repeated retrieval may persist at least 4 weeks after the treatment (Study 4). 10. If learners do not have enough time, it may be efficient to practice retrieval only once (Study 4). 11. Learners may study with either immediate or delayed feedback. Immediate feedback may be used in paper-based flashcard learning because it may be easier to implement manually (Study 4). 12. Increasing retrieval frequency may enhance learning more than increasing the number of encounters with target words (Study 4). 13. Practising retrieval in a difficult and effortful condition (e.g., increasing spacing between encounters and using a demanding format such as productive recall) may enhance learning (Studies 1, 2, and 3). 14. Learning phase performance may not necessarily be a good index of long-term retention. It may be useful to raise awareness that making mistakes during learning is not necessarily a sign of ineffective learning (Studies 1, 2, and 3).
376
6.4 Limitations and Further Research Although the results from the four studies in this thesis are useful, the current thesis also suffers from some limitations. One limitation may be the rather short duration of the treatments. The duration of the treatments in this thesis ranged from 6.48 (Retrieval 1 group in Study 4) to 48.79 minutes (self-paced group in Study 2) on average, and the number of target words ranged from 16 (Study 4) to 60 (Study 2). Furthermore, the study opportunities were massed into a single treatment session in all four studies. In a real-life study situation, however, study opportunities tend to be distributed over multiple sessions (Cepeda et al., 2008). In future research, it may be valuable to investigate the effects of factors such as BS, retrieval formats, spacing, retrieval frequency, and feedback timing on flashcard learning over a longer period of time.
Another limitation may be the type of spacing used in the present thesis. Previous studies have distinguished between two types of spacing: between-session spacing and within-session spacing (e.g., Kang et al., 2013; Kornell, 2009). The former refers to the amount of spacing between separate study sessions, whereas the latter refers to the amount of spacing between encounters of a given item in a single study session (see Chapter 4 for further details). Although Studies 1 and 3 manipulated spacing, both studies were concerned only with within-session spacing. Future research may manipulate between-session spacing (see Cepeda et al., 2009; Kang et al., 2013, for
377 examples). A third limitation may be a possible ceiling effect. In Study 2, nearly or more than half of the scores were full marks on the recognition posttests. Similarly, in Study 4, the immediate posttest scores of the Retrieval 5 and 7 groups neared the ceiling. The lack of statistical significance on these posttests may be partly ascribed to a possible ceiling effect.
A fourth limitation may be that the participants took multiple posttests in all experiments. In Study 2, for instance, four kinds of posttests (productive recall, productive recognition, receptive recall, and receptive recognition) were given to each participant. Because correct responses in the productive test were used as cues in the receptive test, and correct responses in the receptive test were used as cues in the productive test, earlier tests might have affected performance on later tests. Similarly, in Study 4, each participant sat posttests at three RIs: immediately, 1 week, and 4 weeks after the treatment, potentially causing earlier tests to affect performance on later tests. Future research may administer fewer posttests per participant in order to reduce potential learning effects from posttests (e.g., Karpicke & Roediger, 2007a; Logan & Balota, 2008; Rohrer et al., 2005).
Fifth, SDs on the posttest scores were relatively large in the current thesis (see 2.2.13 and 4.7.1), indicating that there existed large variations between individuals. Individual differences, however, were not dealt with in the current thesis. Future research may investigate how individual learner factors such as L2 proficiency or
378 working memory capacity may affect flashcard learning.
Lastly, there were two major methodological differences between Study 2 and the other three studies in this thesis. While Study 2 investigated the learning of Swahili vocabulary by English native speakers, the other three looked into English vocabulary learning by Japanese learners. Furthermore, although the participants in Studies 1, 3, and 4 had prior knowledge of the target language, those in Study 2 did not. The findings of Study 2, therefore, may not necessarily be directly comparable to those of the other three studies. Furthermore, because Japanese students tend to be proficient at flashcard learning (Tinkham, 1989), the results of Studies 1, 3, and 4 may not necessarily be generalised to other populations. Future research may replicate these studies with participants with different background.
6.5 Conclusion The purpose of this thesis was to investigate how we can optimise L2 vocabulary learning from flashcards. The effects of the following seven factors were investigated: BS (Study 1), retrieval formats (Study 2), feedback pacing (Study 2), absolute spacing (Studies 1 and 3), relative spacing (Study 3), retrieval frequency (Study 4), and feedback timing (Study 4). The results from the four studies in this thesis allowed us to propose guidelines regarding the optimal way to learn from flashcards (see 6.3). Although some of the guidelines are consistent with the findings of earlier research (e.g., spacing increases learning and learning phase performance may not necessarily
379 be a good index of long-term retention), the present thesis also produced some non-obvious findings such as the following: (a) BS may have little effect on posttest performance as long as spacing is equivalent, (b) for the acquisition of form-meaning connections, recognition formats may be more desirable than recall, (c) learners may study with either equal or expanding spacing, (d) learners may study with either immediate or delayed feedback, (e) it may be most desirable to practice retrieval five times, and (f) the advantage of repeated retrieval may persist at least 4 weeks after the treatment. Although flashcard learning may be effective, efficient, useful, and common, previous studies suggest that many learners are not very proficient at it (see Chapter 1). The guidelines proposed in this thesis will hopefully contribute to improved performance using flashcards.
380
381 Appendix A: Example of the Pretest (Studies 1, 3, and 4)
Receptive pretest No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Cue apple orange banana urn toupee dally rue mirth gouge scowl loach mane billow quail vestige fawn cadge apparition warble fracas citadel levee grig pique nadir promontory
Correct response りんご オレンジ バナナ 骨つぼ かつら もてあそぶ 後悔する 陽気 彫る にらむ ドジョウ たてがみ うねり ウズラ なごり へつらう ねだる 幽霊 さえずる けんか 砦 堤防 コオロギ 怒らせる どん底 岬
382 Productive pretest No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Cue りんご オレンジ バナナ 骨つぼ ドジョウ 陽気 彫る にらむ ウズラ かつら もてあそぶ たてがみ 後悔する うねり なごり 怒らせる 堤防 砦 けんか コオロギ 幽霊(ゆうれい) どん底 ねだる へつらう さえずる 岬
Retrieval cue __p__ __a___ ___a__ _r_ __a__ _i___ _o___ _c___ __a__ ____e_ _a___ __n_ _u_ ___l__ _e_____ _i___ ___e_ _____e_ __a___ __i_ ______t___ _a___ _a___ _a__ _a____ ______t___
Correct response apple orange banana urn loach mirth gouge scowl quail toupee dally mane rue billow vestige pique levee citadel fracas grig apparition nadir cadge fawn warble promontory
383 Appendix B: Example of the Posttest (Studies 1, 3, and 4)
Productive posttest No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Cue なごり 岬 骨つぼ 彫る にらむ ウズラ 怒らせる へつらう どん底 うねり 幽霊 砦 コオロギ ドジョウ さえずる 後悔する 堤防 たてがみ もてあそぶ けんか 陽気 かつら ねだる
Correct response vestige promontory urn gouge scowl quail pique fawn nadir billow apparition citadel grig loach warble rue levee mane dally fracas mirth toupee cadge
384
Receptive posttest No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Cue promontory urn vestige fawn grig pique quail gouge billow scowl citadel apparition nadir loach fracas cadge warble dally levee toupee mirth mane rue
Correct response 岬 骨つぼ なごり へつらう コオロギ 怒らせる ウズラ 彫る うねり にらむ 砦 幽霊 どん底 ドジョウ けんか ねだる さえずる もてあそぶ 堤防 かつら 陽気 たてがみ 後悔する
385
Appendix C: Retrieval Cues Used in the Productive Pretest (Experiment 1B)
English Target items apparition billow cadge citadel dally fawn fracas gouge grig levee Filler items promontory urn
Retrieval cue ______t___ ___l__ _a___ _____e_ _a___ _a__ __a___ _o___ __i_ ___e_ ______t___ _r_
English loach mane mirth nadir pique quail rue scowl toupee warble vestige
Retrieval cue __a__ __n_ _i___ _a___ _i___ __a__ _u_ _c___ ____e_ _a____ _e_____
386 Appendix D: Swahili-English Word Pairs Used in Study 2
Item set A Swahili adhama bandari chura desturi elimu fumbo hadithi joko maiti mfupa nabii pipa samadi yamini zulia
B English honour harbour frog custom science mystery story kiln corpse bone prophet barrel manure oath carpet
Swahili adui baharia chakula fununu gharika hamira jani jeraha kamba leso rafiki sanda talaka yatima ziwa
C English enemy sailor food rumour flood yeast leaf wound rope scarf friend shroud divorce orphan lake
Swahili dalasini duara embe farasi goti hariri inda kaputula lango malkia nanga ruba theluji ubini wasaa
English cinnamon wheel mango horse knee silk spite shorts gate queen anchor leech snow forgery leisure
D Swahili ankra bustani chimbo gutu handaki iktisadi jibini lozi nira rembo sumu utenzi vuke vumbi wakili
English invoice garden quarry stump trench economy cheese almond yoke ornament poison poem steam dust agent
387 Appendix E: Example of the Posttest (Study 2)
Productive recall posttest No.
Cue
Correct response
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
poem enemy custom mango stump food bone spite trench orphan carpet snow invoice divorce story horse cheese scarf mystery cinnamon dust friend frog leech almond sailor harbour anchor poison rumour kiln silk agent lake science leisure yoke
utenzi adui desturi embe gutu chakula mfupa inda handaki yatima zulia theluji ankra talaka hadithi farasi jibini leso fumbo dalasini vumbi rafiki chura ruba lozi baharia bandari nanga sumu fununu joko hariri wakili ziwa elimu wasaa nira
388 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
rope manure forgery economy leaf honour shorts ornament flood corpse gate steam yeast oath knee garden wound prophet queen quarry shroud barrel wheel
kamba samadi ubini iktisadi jani adhama kaputula rembo gharika maiti lango vuke hamira yamini goti bustani jeraha nabii malkia chimbo sanda pipa duara
389 Productive recognition posttest No.
Cue
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
cheese silk frog snow enemy story food dust scarf carpet leech friend bone anchor invoice mango rumour poison divorce mystery poem kiln horse harbour cinnamon almond orphan custom stump sailor trench leisure prophet flood oath garden spite yeast quarry corpse
Multiple-choice options 1 jibini gharika chura theluji fununu hadithi utenzi embe bandari gutu adhama pipa jani dalasini ankra adui chura hariri chakula handaki jibini wakili theluji rembo dalasini duara pipa goti ankra ziwa hadithi wasaa embe elimu sanda joko talaka hamira chimbo nira
2 hamira hariri fumbo wakili nabii handaki lango vumbi jeraha zulia maiti rafiki chimbo nanga yatima embe yamini elimu lango fumbo vuke gharika inda jeraha nanga maiti zulia leso malkia ruba sumu kamba nabii gharika yamini chakula ubini fununu mfupa ruba
3 elimu talaka ubini joko adui sumu chakula samadi duara goti baharia lozi mfupa malkia rembo bustani samadi sumu talaka ubini wasaa joko kamba mfupa adhama rafiki nira kaputula jani chimbo handaki fumbo lango farasi utenzi bustani inda jibini duara maiti
4 farasi sanda yamini wasaa vuke kamba bustani inda leso iktisadi ruba nira desturi ziwa kaputula sanda fununu nabii hamira hadithi utenzi vumbi farasi bandari iktisadi lozi yatima desturi gutu baharia wakili theluji adui hariri chura samadi vumbi vuke yatima desturi
Correct response 1. jibini 2. hariri 1. chura 1. theluji 3. adui 1. hadithi 3. chakula 2. vumbi 4. leso 2. zulia 4. ruba 2. rafiki 3. mfupa 2. nanga 1. ankra 2. embe 4. fununu 3. sumu 3. talaka 2. fumbo 4. utenzi 3. joko 4. farasi 4. bandari 1. dalasini 4. lozi 4. yatima 4. desturi 4. gutu 4. baharia 3. handaki 1. wasaa 2. nabii 2. gharika 2. yamini 3. bustani 3. inda 1. hamira 1. chimbo 3. maiti
390 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
wound shorts economy barrel queen rope steam science agent gate manure forgery shroud knee leaf yoke wheel lake ornament honour
leso ziwa iktisadi gutu adhama chakula nabii yamini joko lango samadi jibini sanda chimbo jani iktisadi lozi baharia bandari adhama
jeraha ankra jani rafiki bandari kamba vuke sumu wakili inda bustani fumbo adui desturi nanga nira maiti kaputula pipa zulia
baharia kaputula zulia rembo malkia wasaa hariri elimu chura talaka vumbi hamira embe goti malkia rafiki ruba ziwa ankra jeraha
goti dalasini lozi pipa nanga farasi utenzi theluji gharika hadithi fununu ubini handaki leso gutu yatima duara mfupa rembo dalasini
2. jeraha 3. kaputula 1. iktisadi 4. pipa 3. malkia 2. kamba 2. vuke 3. elimu 2. wakili 1. lango 1. samadi 4. ubini 1. sanda 3. goti 1. jani 2. nira 4. duara 3. ziwa 4. rembo 1. adhama
391 Receptive recall posttest No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
Cue inda sumu fununu fumbo hariri gutu rafiki joko ruba handaki chakula chura dalasini utenzi baharia hadithi nanga ankra adui mfupa theluji lozi talaka bandari farasi vumbi leso desturi embe jibini yatima zulia duara iktisadi kamba nabii goti vuke jeraha pipa kaputula
Correct response spite poison rumour mystery silk stump friend kiln leech trench food frog cinnamon poem sailor story anchor invoice enemy bone snow almond divorce harbour horse dust scarf custom mango cheese orphan carpet wheel economy rope prophet knee steam wound barrel shorts
392 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
bustani hamira samadi lango wakili sanda maiti malkia rembo jani adhama ubini nira ziwa elimu wasaa chimbo gharika yamini
garden yeast manure gate agent shroud corpse queen ornament leaf honour forgery yoke lake science leisure quarry flood oath
393 Receptive recognition posttest No.
Cue
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
adui hadithi hariri vumbi theluji chakula chura jibini rafiki ankra zulia ruba leso mfupa nanga fumbo utenzi fununu farasi talaka joko sumu embe desturi lozi yatima bandari gutu baharia dalasini hamira nabii inda yamini handaki wasaa bustani gharika iktisadi jeraha
1 rumour story shroud dust agent garden forgery science barrel shorts carpet corpse scarf bone queen trench steam rumour rope gate agent poison garden shorts wheel carpet harbour queen lake anchor yeast gate forgery poem story snow kiln horse carpet knee
Multiple-choice options 2 3 prophet enemy trench poison silk divorce mango manure kiln leisure food gate mystery oath horse cheese yoke almond orphan invoice stump economy leech honour harbour wound leaf custom cinnamon lake mystery forgery leisure cheese manure frog spite snow yeast divorce dust flood science silk shroud mango scarf knee almond corpse yoke barrel ornament wound invoice leaf sailor leech economy cinnamon cheese steam prophet enemy dust divorce oath frog poison trench rope leisure food garden flood silk almond leaf sailor scarf
4 steam rope flood spite snow poem frog yeast friend ornament knee sailor wheel quarry anchor story poem oath horse food kiln prophet enemy custom friend orphan bone stump quarry honour rumour mango spite shroud agent mystery manure science economy wound
Correct response 3. enemy 1. story 2. silk 1. dust 4. snow 2. food 4. frog 3. cheese 4. friend 3. invoice 1. carpet 2. leech 1. scarf 1. bone 4. anchor 2. mystery 4. poem 1. rumour 4. horse 3. divorce 4. kiln 1. poison 3. mango 4. custom 2. almond 4. orphan 1. harbour 4. stump 2. sailor 3. cinnamon 1. yeast 2. prophet 4. spite 2. oath 3. trench 3. leisure 3. garden 2. flood 4. economy 4. wound
394 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
malkia pipa chimbo kaputula maiti ubini kamba vuke samadi wakili lango sanda elimu duara adhama jani nira goti ziwa rembo
harbour stump orphan shorts leech cheese food poem dust frog story mango poison wheel carpet anchor yoke knee lake ornament
queen ornament quarry cinnamon yoke mystery horse silk garden agent gate shroud snow leech honour stump economy custom shorts harbour
honour friend bone lake corpse forgery rope prophet manure kiln divorce enemy science corpse cinnamon leaf friend scarf bone barrel
anchor barrel wheel invoice custom yeast leisure steam rumour flood spite trench oath almond wound queen orphan quarry sailor invoice
2. queen 4. barrel 2. quarry 1. shorts 3. corpse 3. forgery 3. rope 4. steam 3. manure 2. agent 2. gate 2. shroud 3. science 1. wheel 2. honour 3. leaf 1. yoke 1. knee 1. lake 1. ornament
395 Appendix F: Retrieval Cues Used in the Productive Pretest (Study 3)
English Target items apparition billow cadge citadel dally fawn fracas gouge apparition billow Filler items abet banter exalt husk jibe polemic promontory
Retrieval cue ______t___ ___l__ _a___ _____e_ _a___ _a__ __a___ _o___ ______t___ ___l__ __e_ ____e_ __a__ _u__ _i__ ___e___ ______t___
English grig levee loach mane mirth nadir pique quail rue scowl smudge tyro urn usurp valor vestige
Retrieval cue _r__ ___e_ __a__ __n_ _i___ _a___ _i___ __a__ _u_ _c___ ____g_ __r_ _r_ __u__ _a___ _e_____
396 Appendix G: Retrieval Cues Used in the Productive Pretest (Study 4)
English Target items apparition billow citadel dally gouge grig husk jibe Filler items lava tyro
Retrieval cue ___a______ ___l__ _____e_ __l__ _o___ _r__ _u__ _i__ _a__ __r_
English levee loach mane mirth rue toupee urn warble valor
Retrieval cue ___e_ _o___ __n_ __r__ _u_ ____e_ _r_ ____l_ __l__
397 References Anderson, J. P., & Jordan, A. M. (1928). Learning and retention of Latin words and phrases. Journal of Educational Psychology, 19, 485–496. doi:10.1037/h0073011 Baddeley, A. D. (1997). Human memory: Theory and practice (Revised ed.). East Sussex, UK: Psychology Press. Baddeley, A. D., Thomson, N., & Buchanan, M. (1975). Word length and the structure of short-term memory. Journal of Verbal Learning and Verbal Behavior, 14, 575–589. doi:10.1016/S0022-5371(75)80045-4 Bahrick, H. P. (1984). Semantic memory content in permastore: Fifty years of memory for Spanish learned in school. Journal of Experimental Psychology: General, 113, 1–37. doi:10.1037/0096-3445.113.1.1 Bahrick, H. P., Bahrick, L. E., Bahrick, A. S., & Bahrick, P. E. (1993). Maintenance of foreign language vocabulary and the spacing effect. Psychological Science, 4, 316–321. doi:10.1111/j.1467-9280.1993.tb00571.x Bahrick, H. P., & Hall, L. K. (2005). The importance of retrieval failures to long-term retention: A metacognitive explanation of the spacing effect. Journal of Memory and Language, 52, 566–577. doi:10.1016/j.jml.2005.01.012 Bahrick, H. P., & Phelps, E. (1987). Retention of Spanish vocabulary over 8 years. Journal of Experimental Psychology: Learning, Memory, & Cognition, 13, 344–349. doi:10.1037/0278-7393.13.2.344
398 Balota, D. A., Duchek, J. M., & Logan, J. M. (2007). Is expanded retrieval practice a superior form of spaced retrieval? A critical review of the extent literature. In J. S. Nairne (Ed.), The foundations of remembering: Essays in honor of Henry L. Roediger III (pp. 83–105). New York, NY: Psychology Press. Balota, D. A., Duchek, J. M., & Paullin, R. (1989). Age-related differences in the impact of spacing, lag, and retention interval. Psychology and Aging, 4, 3–9. doi:10.1037/0882-7974.4.1.3 Balota, D. A., Duchek, J. M., Sergent-Marshall, S. D., & Roediger, H. L. (2006). Does expanded retrieval produce benefits over equal-interval spacing? Explorations of spacing effects in healthy aging and early stage Alzheimer’s disease. Psychology and Aging, 21, 19–31. doi:10.1037/0882-7974.21.1.19 Barcroft, J. (2002). Semantic and structural elaboration in L2 lexical acquisition. Language Learning, 52, 323–363. doi:10.1111/0023-8333.00186 Barcroft, J. (2003). Effects of questions about word meaning during L2 Spanish lexical learning. The Modern Language Journal, 87, 546–561. doi:10.1111/1540-4781.00207 Barcroft, J. (2004). Effects of sentence writing in second language lexical acquisition. Second Language Research, 20, 303–334. doi:10.1191/0267658304sr233oa Barcroft, J. (2007). Effects of opportunities for word retrieval during second language vocabulary learning. Language Learning, 57, 35–56. doi:10.1111/j.1467-9922.2007.00398.x
399 Barcroft, J., & Rott, S. (2010). Partial word form learning in the written mode in L2 German and Spanish. Applied Linguistics, 31, 623–650. doi:10.1093/applin/amq017 Berlyne, D. E. (1966). Conditions of prequestioning and retention of meaningful material. Journal of Educational Psychology, 57, 128–132. doi:10.1037/h0023346 Bird, S. (2010). Effects of distributed practice on the acquisition of second language English syntax. Applied Psycholinguistics, 31, 635–650. doi:10.1017/S0142716410000172 Bird, S. (2012). Expert knowledge, distinctiveness, and levels of processing in language learning. Applied Psycholinguistics, 33, 665–689. doi:10.1017/S014271641100052X Bjork, R. A. (1994). Memory and metamemory considerations in the training of human beings. In J. Metcalfe & A. Shimamura (Eds.), Metacognition: Knowing about knowing (pp. 185–205). Cambridge, MA: MIT Press. Bjork, R. A. (1999). Assessing our own competence: Heuristics and illusions. In D. Gopher & A. Koriat (Eds.), Attention and performance XVII: Cognitive regulation of performance: Interaction of theory and application (pp. 435–459). Cambridge, MA: MIT Press. Bjork, R. A., & Whitten, W. B. (1974). Recency-sensitive retrieval processes in long-term free recall. Cognitive Psychology, 6, 173–189. doi:10.1016/0010-0285(74)90009-7
400 Bloom, K. C., & Shuell, T. J. (1981). Effects of massed and distributed practice on the learning and retention of second-language vocabulary. The Journal of Educational Research, 74, 245–248. Bonk, W. J., & Healy, A. F. (2010). Learning and memory for sequences of pictures, words, and spatial locations. American Journal of Psychology, 123, 137–168. Bransford, J. D., Franks, J. J., Morris, C. D., & Stein, B. S. (1979). Some general constraints on learning and memory research. In L. S. Cermak & F. I. M. Craik (Eds.), Levels of processing in human memory (pp. 331–354). Mahwah, NJ: Erlbaum. Brown, W. (1924). Whole and part methods in learning. Journal of Educational Psychology, 15, 229–233. doi:10.1037/h0072268 Butler, A. C., Karpicke, J. D., & Roediger, H. L. (2007). The effect of type and timing of feedback on learning from multiple-choice tests. Journal of Experimental Psychology: Applied, 13, 273–281. doi:10.1037/1076-898X.13.4.273 Butler, A. C., & Roediger, H. L. (2007). Testing improves long-term retention in a simulated classroom setting. European Journal of Cognitive Psychology, 19, 514–527. doi:10.1080/09541440701326097 Carpenter, S. K., & DeLosh, E. L. (2005). Application of the testing and spacing effects to name learning. Applied Cognitive Psychology, 19, 619–636. doi:10.1002/acp.1101
401 Carpenter, S. K., & DeLosh, E. L. (2006). Impoverished cue support enhances subsequent retention: Support for the elaborative retrieval explanation of the testing effect. Memory and Cognition, 34, 268–276. doi:10.3758/BF03193405 Carpenter, S. K., & Olson, K. M. (2012). Are pictures good for learning new vocabulary in a foreign language? Only if you think they are not. Journal of Experimental Psychology: Learning, Memory, and Cognition, 38, 92–101. doi:10.1037/a0024828 Carpenter, S. K., & Vul, E. (2011). Delaying feedback by three seconds benefits retention of face–name pairs: The role of active anticipatory processing. Memory & Cognition, 39, 1211–1221. doi:10.3758/s13421-011-0092-1 Cepeda, N. J., Coburn, N., Rohrer, D., Wixted, J. T., Mozer, M. C., & Pashler, H. (2009). Optimizing distributed practice: Theoretical analysis and practical implications. Experimental Psychology, 56, 236–246. doi:10.1027/1618-3169.56.4.236 Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin, 132, 354–380. doi:10.1037/0033-2909.132.3.354 Cepeda, N. J., Vul, E., Rohrer, D., Wixted, J. T., & Pashler, H. (2008). Spacing effects in learning: A temporal ridgeline of optimal retention. Psychological Science, 19, 1095 –1102. doi:10.1111/j.1467-9280.2008.02209.x Chun, E., Choi, S., & Kim, J. (2012). The effect of extensive reading and paired-associate learning on long-term vocabulary retention: An event-related
402 potential study. Neuroscience Letters, 521, 125–129. doi:10.1016/j.neulet.2012.05.069 Clariana, R. B., & Lee, D. (2001). The effects of recognition and recall study tasks with feedback in a computer-based vocabulary lesson. Educational Technology Research and Development, 49, 23–36. doi:10.1007/BF02504913 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum. Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159. doi:10.1037/0033-2909.112.1.155 Cowan, N. (2001). The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24, 87–114. doi:10.1017/S0140525X01003922 Crothers, E., & Suppes, P. (1967). Experiments in second language learning. New York, NY: Academic Press. Cull, W. L. (2000). Untangling the benefits of multiple study opportunities and repeated testing for cued recall. Applied Cognitive Psychology, 14, 215–235. doi:10.1002/(SICI)1099-0720(200005/06)14:33.0.CO;2-1 Cull, W. L., Shaughnessy, J. J., & Zechmeister, E. B. (1996). Expanding understanding of the expanding-pattern-of-retrieval mnemonic: Toward confidence in applicability. Journal of Experimental Psychology: Applied, 2, 365–378. doi:10.1037/1076-898X.2.4.365
403 De Groot, A. M. B. (2006). Effects of stimulus characteristics and background music on foreign language vocabulary learning and forgetting. Language Learning, 56, 463–506. doi:10.1111/j.1467-9922.2006.00374.x De Groot, A. M. B., & van Hell, J. G. (2005). The learning of foreign language vocabulary. In J. F. Kroll & A. M. B. de Groot (Eds.), Handbook of bilingualism: Psycholinguistic approaches (pp. 9–29). New York, NY: Oxford University Press. Deconinck, J., Boers, F., & Eyckmans, J. (2010). Helping learners engage with L2 words: The form-meaning fit. AILA Review, 23, 95–114. doi:10.1075/aila.23.06dec DeKeyser, R. (2007). Situating the concept of practice. In R. DeKeyser (Ed.), Practice in a second language: Perspectives from applied linguistics and cognitive psychology (pp. 1–18). New York, NY: Cambridge University Press. Delaney, P. F., Verkoeijen, P. P. J. L., & Spirgel, A. (2010). Spacing and testing effects: A deeply critical, lengthy, and at times discursive review of the literature. Psychology of Learning and Motivation, 53, 63–147. doi:10.1016/S0079-7421(10)53003-2 Dempster, F. N. (1988). The spacing effect: A case study in the failure to apply the results of psychological research. American Psychologist, 43, 627–634. doi:10.1037//0003-066X .43.8.627. Dempster, F. N. (1989). Spacing effects and their implications for theory and practice. Educational Psychology Review, 1, 309–330. doi:10.1007/BF01320097
404 Dempster, F. N. (1996). Distributing and managing the conditions of encoding and practice. In E. L. Bjork & R. A. Bjork (Eds.), Human memory (pp. 197–236). San Diego, CA: Academic Press. Deno, S. L. (1968). Effects of words and pictures as stimuli in learning language equivalents. Journal of Educational Psychology, 59, 202–206. doi:10.1037/h0025772 Dobson, J. L. (2011). Effect of selected “desirable difficulty” learning strategies on the retention of physiology information. Advances in Physiology Education, 35, 378 –383. doi:10.1152/advan.00039.2011 Dobson, J. L. (2012). Effect of uniform versus expanding retrieval practice on the recall of physiology information. Advances in Physiology Education, 36, 6–12. doi:10.1152/advan.00090.2011 Donovan, J. J., & Radosevich, D. J. (1999). A meta-analytic review of the distribution of practice effect: Now you see it, now you don’t. Journal of Applied Psychology, 84, 795–805. doi:10.1037/0021-9010.84.5.795 Duchastel, P. C. (1981). Retention of prose following testing with different types of tests. Contemporary Educational Psychology, 6, 217–226. doi:10.1016/0361-476X(81)90002-3 Elgort, I. (2011). Deliberate learning and vocabulary acquisition in a second language. Language Learning, 61, 367–413. doi:10.1111/j.1467-9922.2010.00613.x
405 Ellis, N. C. (1995). The psychology of foreign language vocabulary acquisition: Implications for CALL. Computer Assisted Language Learning, 8, 103–128. doi:10.1080/0958822940080202 Ellis, N. C., & Beaton, A. (1993). Psycholinguistic determinants of foreign-language vocabulary learning. Language Learning, 43, 559–617. doi:10.1111/j.1467-1770.1993.tb00627.x Erten, I. H., & Tekin, M. (2008). Effects on vocabulary acquisition of presenting new words in semantic sets versus semantically unrelated sets. System, 36, 407–422. doi:10.1016/j.system.2008.02.005 eSpindle Learning. (2013). Answers to common questions about our vocabulary and spelling program. LearnThatWord. Retrieved March 27, 2013, from http://www.learnthat.org/pages/view/faq.html Estes, W. K. (1960). Learning theory and the new “mental chemistry.” Psychological Review, 67, 207–223. doi:10.1037/h0041624 Field, A. (2009). Discovering statistics using SPSS (3rd ed.). London, UK: SAGE Publications. Finkbeiner, M., & Nicol, J. (2003). Semantic category effects in second language word learning. Applied Psycholinguistics, 24, 369–383. doi:10.1017/S0142716403000195 Finley, J. R., Benjamin, A. S., Hays, M. J., Bjork, R. A., & Kornell, N. (2011). Benefits of accumulating versus diminishing cues in recall. Journal of Memory and Language, 64, 289–298. doi:10.1016/j.jml.2011.01.006
406 Fitzpatrick, T., Al-Qarni, I., & Meara, P. (2008). Intensive vocabulary learning: A case study. Language Learning Journal, 36, 239–248. doi:10.1080/09571730802390759 Folse, K. S. (2004). Vocabulary myths: Applying second language research to classroom teaching. Ann Arbor, MI: University of Michigan Press. Foos, P. W., & Fisher, R. P. (1988). Using tests as learning opportunities. Journal of Educational Psychology, 80, 179–183. doi:10.1037/0022-0663.80.2.179 Fritz, C. O. (2010). Testing, generation, and spacing applied to education: Past, present and future. In A. S. Benjamin (Ed.), Successful remembering and successful forgetting: A festschrift in honor of Robert A. Bjork (pp. 199–216). New York, NY: Psychology Press. Gathercole, S. E., Frankish, C. R., Pickering, S. J., & Peaker, S. (1999). Phonotactic influences on short-term memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25, 84–95. doi:10.1037/0278-7393.25.1.84 Gay, L. R. (1973). Temporal position of reviews and its effect on the retention of mathematical rules. Journal of Educational Psychology, 64, 171–182. doi:10.1037/h0034595 Gerbier, E., & Koenig, O. (2012). Influence of multiple-day temporal distribution of repetitions on memory: A comparison of uniform, expanding, and contracting schedules. Quarterly Journal of Experimental Psychology, 65, 514–525. doi:10.1080/17470218.2011.600806
407 Gillund, G., & Shiffrin, R. M. (1984). A retrieval model for both recognition and recall. Psychological Review, 91, 1–67. doi:10.1037/0033-295X.91.1.1 Glanzer, M., & Cunitz, A. R. (1966). Two storage mechanisms in free recall. Journal of Verbal Learning and Verbal Behavior, 5, 351–360. doi:10.1016/S0022-5371(66)80044-0 Glenberg, A. M. (1976). Monotonic and nonmonotonic lag effects in paired-associate and recognition memory paradigms. Journal of Verbal Learning and Verbal Behavior, 15, 1–16. doi:10.1016/S0022-5371(76)90002-5 Glenberg, A. M., & Lehmann, T. S. (1980). Spacing repetitions over 1 week. Memory and Cognition, 8, 528–538. doi:10.3758/BF03213772 Glover, J. A. (1989). The “testing” phenomenon: Not gone but nearly forgotten. Journal of Educational Psychology, 81, 392–399. doi:10.1037/0022-0663 .81.3.392 Godwin-Jones, R. (2008). Emerging technologies mobile-computing trends: Lighter, faster, smarter. Language Learning & Technology, 12(3), 3–9. Godwin-Jones, R. (2010). From memory palaces to spacing algorithms: Approaches to second-language vocabulary learning. Language Learning & Technology, 14(2), 4–11. Griffin, G. F. (1992). Aspects of the psychology of second language vocabulary list learning (Unpublished doctoral dissertation). University of Warwick, UK. Griffin, G. F., & Harley, T. A. (1996). List learning of second language vocabulary. Applied Psycholinguistics, 17, 443–460. doi:10.1017/S0142716400008195
408 Hagemeier, N. E., & Mason, H. L. (2011). Student pharmacists’ perceptions of testing and study strategies. American Journal of Pharmaceutical Education, 75, 35. doi:10.5688/ajpe75235 Hartwig, M. K., & Dunlosky, J. (2011). Study strategies of college students: Are self-testing and scheduling related to achievement? Psychonomic Bulletin & Review, 19, 126–134. doi:10.3758/s13423-011-0181-y Haynes, C. R. (1974). Delayed feedback and perseveration of interference. Psychological Reports, 35, 246. doi:10.2466/pr0.1974.35.1.246 Hays, M. J., Kornell, N., & Bjork, R. A. (2010). The costs and benefits of providing feedback during learning. Psychonomic Bulletin & Review, 17, 797–801. doi:10.3758/PBR.17.6.797 Higa, M. (1963). Interference effects of intralist word relationships in verbal learning. Journal of Verbal Learning and Verbal Behavior, 2, 170–175. doi:10.1016/S0022-5371(63)80082-1 Hulstijn, J. H. (1997). Mnemonic methods in foreign language vocabulary learning: Theoretical considerations and pedagogical implications. In Second language vocabulary acquisition: A rationale for pedagogy (pp. 203–224). Cambridge, UK: Cambridge University Press. Hulstijn, J. H. (2001). Intentional and incidental second language vocabulary learning: A reappraisal of elaboration, rehearsal, and automaticity. In P. Robinson (Ed.), Cognition and second language instruction (pp. 258–286). Cambridge, UK: Cambridge University Press.
409 Hunt, A., & Beglar, D. (2005). A framework for developing EFL reading vocabulary. Reading in a Foreign Language, 17, 23–59. Jalbert, A., Neath, I., Bireta, T. J., & Surprenant, A. M. (2011). When does length cause the word length effect? Journal of Experimental Psychology: Learning, Memory, and Cognition, 37, 338–353. doi:10.1037/a0021804 Janiszewski, C., Noel, H., & Sawyer, A. G. (2003). A meta-analysis of the spacing effect in verbal learning: Implications for research on advertising repetition and consumer memory. Journal of Consumer Research, 30, 138–149. doi:10.1086/374692 Jefferies, E., Lambon Ralph, M. A., & Baddeley, A. D. (2004). Automatic and controlled processing in sentence recall: The role of long-term and working memory. Journal of Memory and Language, 51, 623–643. doi:10.1016/j.jml.2004.07.005 Joseph, S., Watanabe, Y., Shiung, Y.-J., Choi, B., & Robbins, C. (2009). Key aspects of computer assisted vocabulary learning (CAVL): Combined effects of media, sequencing and task type. Research and Practice in Technology Enhanced Learning, 4, 1–36. doi:10.1142/S1793206809000672 Kang, S. H. K., Lindsey, R. V., Mozer, M. C., & Pashler, H. (2013). Retrieval practice over the long term: Expanding or equal-interval spacing? Manuscript submitted for publication. Kang, S. H. K., McDermott, K. B., & Roediger, H. L. (2007). Test format and corrective feedback modify the effect of testing on long-term retention.
410 European Journal of Cognitive Psychology, 19, 528–558. doi:10.1080/09541440601056620 Karpicke, J. D. (2009). Metacognitive control and strategy selection: Deciding to practice retrieval during learning. Journal of Experimental Psychology: General, 138, 469–486. doi:10.1037/a0017341 Karpicke, J. D., & Bauernschmidt, A. (2011). Spaced retrieval: Absolute spacing enhances learning regardless of relative spacing. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37, 1250–1257. doi:10.1037/a0023436 Karpicke, J. D., Butler, A. C., & Roediger, H. L. (2009). Metacognitive strategies in student learning: Do students practice retrieval when they study on their own? Memory, 17, 471–479. doi:10.1080/09658210802647009 Karpicke, J. D., & Roediger, H. L. (2007a). Expanding retrieval practice promotes short-term retention, but equally spaced retrieval enhances long-term retention. Journal of Experimental Psychology: Learning, Memory, and Cognition, 33, 704–719. doi:10.1037/0278-7393.33.4.704 Karpicke, J. D., & Roediger, H. L. (2007b). Repeated retrieval during learning is the key to long-term retention. Journal of Memory and Language, 57, 151–162. doi:10.1016/j.jml.2006.09.004 Karpicke, J. D., & Roediger, H. L. (2008). The critical importance of retrieval for learning. Science, 319, 966–968. doi:10.1126/science.1152408
411 Karpicke, J. D., & Roediger, H. L. (2010). Is expanding retrieval a superior method for learning text materials? Memory and Cognition, 38, 116–124. doi:10.3758/MC.38.1.116 Karpicke, J. D., Smith, M. A., & Grimaldi, P. J. (2009). Learning with flashcards: You’re probably doing it wrong. Poster presented at the Eighty-first Annual Meeting of the Midwest Psychological Association, Chicago, IL. Kirk, R. E. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56, 746–759. doi:10.1177/0013164496056005002 Kline, R. B. (2004). Beyond significance testing: Reforming data analysis methods in behavioral research. Washington, DC: American Psychological Association. Kopstein, F. F., & Roshal, S. M. (1954). Learning foreign vocabulary from pictures vs. words. American Psychologist, 9, 407–408. Kornell, N. (2009). Optimising learning using flashcards: Spacing is more effective than cramming. Applied Cognitive Psychology, 23, 1297–1317. doi:10.1002/acp.1537 Kornell, N., & Bjork, R. A. (2008). Optimising self-regulated study: The benefits and costs - of dropping flashcards. Memory, 16, 125–136. doi:10.1080/09658210701763899 Krashen, S. (1989). We acquire vocabulary and spelling by reading: Additional evidence for the Input Hypothesis. The Modern Language Journal, 73, 440–464. doi:10.1111/j.1540-4781.1989.tb05325.x
412 Kulhavy, R. W., & Anderson, R. C. (1972). Delay-retention effect with multiple-choice tests. Journal of Educational Psychology, 63, 505–512. doi:10.1037/h0033243 Kulik, J. A., & Kulik, C.-L. C. (1988). Timing of feedback and verbal learning. Review of Educational Research, 58, 79–97. doi:10.3102/00346543058001079 Lado, R., Baldwin, B., & Lobo, F. (1967). Massive vocabulary expansion in a foreign language beyond the basic course: The effects of stimuli, timing and order of presentation. Washington, DC: U.S. Department of Health, Education, and Welfare. Landauer, T. K., & Bjork, R. A. (1978). Optimum rehearsal patterns and name learning. In M. M. Gruneberg, P. E. Morris, & R. N. Sykes (Eds.), Practical aspects of memory (pp. 625–632). London, UK: Academic Press. Laufer, B. (1997). What’s in a word that makes it hard or easy: Some intralexical factors that affect the learning of words. In N. Schmitt & M. McCarthy (Eds.), Vocabulary: Description, acquisition and pedagogy (pp. 140–155). New York, NY: Cambridge University Press. Laufer, B. (2003). Vocabulary acquisition in a second language: Do learners really acquire most vocabulary by reading? Some empirical evidence. Canadian Modern Language Review, 59, 567–587. doi:10.3138/cmlr.59.4.567 Laufer, B. (2005). Focus on Form in second language vocabulary learning. EUROSLA Yearbook, 5, 223–250. doi:10.1075/eurosla.5.11lau
413 Laufer, B., Elder, C., Hill, K., & Congdon, P. (2004). Size and strength: Do we need both to measure vocabulary knowledge? Language Testing, 21, 202–226. doi:10.1191/0265532204lt277oa Laufer, B., & Goldstein, Z. (2004). Testing vocabulary knowledge: Size, strength, and computer adaptiveness. Language Learning, 54, 399–436. doi:10.1111/j.0023-8333.2004.00260.x Laufer, B., & Shmueli, K. (1997). Memorizing new words: Does teaching have anything to do with it? RELC Journal, 28, 89–108. doi:10.1177/003368829702800106 Levin, J. R., McCormick, C. B., Miller, G. E., Berry, J. K., & Pressley, M. (1982). Mnemonic versus nonmnemonic vocabulary-learning strategies for children. American Educational Research Journal, 19, 121–136. doi:10.1037/0022-0663.74.5.693 Logan, J. M., & Balota, D. A. (2008). Expanded vs. equal interval spaced retrieval practice: Exploring different schedules of spacing and retention interval in younger and older adults. Aging, Neuropsychology, and Cognition, 15, 257–280. doi:10.1080/13825580701322171 Long, M. H., & Richards, J. C. (2007). Series editors’ preface. In R. DeKeyser (Ed.), Practice in a second language: Perspectives from applied linguistics and cognitive psychology (p. xi). New York, NY: Cambridge University Press. Maddox, G. B., Balota, D. A., Coane, J. H., & Duchek, J. M. (2011). The role of forgetting rate in producing a benefit of expanded over equal spaced retrieval in
414 young and older adults. Psychology and Aging, 26, 661–670. doi:10.1037/a0022942 Marsh, E. J., Agarwal, P. K., & Roediger, H. L. (2009). Memorial consequences of answering SAT II questions. Journal of Experimental Psychology: Applied, 15, 1–11. doi:10.1037/a0014721 Marsh, E. J., Roediger, H. L., Bjork, R. A., & Bjork, E. L. (2007). The memorial consequences of multiple-choice testing. Psychonomic Bulletin and Review, 14, 194–199. doi:10.3758/MC.38.4.407 McDaniel, M. A., Anderson, J. L., Derbish, M. H., & Morrisette, N. (2007). Testing the testing effect in the classroom. European Journal of Cognitive Psychology, 19, 494–513. doi:10.1080/09541440701326154 McDaniel, M. A., & Pressley, M. (1984). Putting the keyword method in context. Journal of Educational Psychology, 76, 598–609. doi:10.1037/0022-0663.76.4.598 McGeoch, G. O. (1931). The intelligence quotient as a factor in the whole-part problem. Journal of Experimental Psychology, 14, 333–358. doi:10.1037/h0075956 McNamara, D. S., & Healy, A. F. (1995). A generation advantage for multiplication skill training and nonword vocabulary acquisition. In A. F. Healy & J. L. E. Bourne (Eds.), Learning and memory of knowledge and skills: Durability and specificity (pp. 132–169). Thousand Oaks, CA: Sage.
415 Medler, D. A., & Binder, J. R. (2005). MCWord: An online orthographic database of the English language. Retrieved July 6, 2010, from http://www.neuro.mcw.edu/mcword/ Metcalfe, J., & Kornell, N. (2007). Principles of cognitive science in education: The effects of generation, errors, and feedback. Psychonomic Bulletin and Review, 14, 225–229. Metcalfe, J., Kornell, N., & Finn, B. (2009). Delayed versus immediate feedback in children’s and adults’ vocabulary learning. Memory & Cognition, 37, 1077–1087. doi:10.3758/MC.37.8.1077 Miller, G. A. (1956). The magic number 7, plus or minus 2: Some limits on our capacity for processing information. Psychological Review, 63, 81–97. doi:10.1037/h0043158 Mishima, T. (1967). An experiment comparing five modalities of conveying meaning for the teaching of foreign language vocabulary. Dissertation Abstracts International, 27, 3030–3031A. Modigliani, V. (1976). Effects on a later recall by delaying initial recall. Journal of Experimental Psychology: Human Learning and Memory, 2, 609–622. doi:10.1037/0278-7393.2.5.609 Mondria, J.-A. (2003). The effects of inferring, verifying, and memorizing on the retention of L2 word meanings: An experimental comparison of the “Meaning-Inferred Method” and the “Meaning-Given Method.” Studies in Second Language Acquisition, 25, 473–499. doi:10.1017/S0272263103000202
416 Mondria, J.-A., & Mondria-de Vries, S. (1994). Efficiently memorizing words with the help of word cards and “hand computer”: Theory and applications. System, 22, 47–57. doi:10.1016/0346-251X(94)90039-6 Mondria, J.-A., & Wiersma, B. (2004). Receptive, productive and receptive + productive L2 vocabulary learning: What difference does it make? In B. Laufer (Ed.), Vocabulary in a second language: Selection, acquisition, and testing (pp. 79–100). Amsterdam, Netherlands: John Benjamins. More, A. J. (1969). Delay of feedback and the acquisition and retention of verbal materials in the classroom. Journal of Educational Psychology, 60, 339–342. doi:10.1037/h0028318 Morris, C. D., Bransford, J. D., & Franks, J. J. (1977). Levels of processing versus transfer appropriate processing. Journal of Verbal Learning and Verbal Behavior, 16, 519–533. doi:10.1016/S0022-5371(77)80016-9 Mory, E. H. (2004). Feedback research revisited. In D. H. Jonassen (Ed.), Handbook of research on educational communications and technology (2nd ed., pp. 745–783). Mahwah, NJ: Lawrence Erlbaum. Nagy, W. E., Herman, P. A., & Anderson, R. C. (1985). Learning words from context. Reading Research Quarterly, 20, 233–253. doi:10.2307/747758 Nakata, T. (2008). English vocabulary learning with word lists, word cards and computers: Implications from cognitive psychology research for optimal spaced learning. ReCALL, 20, 3–20. doi:10.1017/S0958344008000219
417 Nakata, T. (2011). Computer-assisted second language vocabulary learning in a paired-associate paradigm: A critical investigation of flashcard software. Computer Assisted Language Learning, 24, 17–38. doi:10.1080/09588221.2010.520675 Nakata, T. (2012). Web-based lexical resources. In C. Chapelle (Ed.), Encyclopedia of Applied Linguistics (pp. 6166–6177). Oxford, UK: Wiley-Blackwell. Nation, I. S. P. (1980). Strategies for receptive vocabulary learning. RELC Guidelines, 3, 18–23. Nation, I. S. P. (1982). Beginning to learn foreign vocabulary: A review of the research. RELC Journal, 13, 14–36. doi:10.1177/003368828201300102 Nation, I. S. P. (2001). Learning vocabulary in another language. Cambridge, UK: Cambridge University Press. Nation, I. S. P. (2004). A study of the most frequent word families in the British National Corpus. In P. Bogaards & B. Laufer (Eds.), Vocabulary in a second language: Selection, acquisition, and testing (pp. 3–13). Amsterdam, Netherlands: John Benjamins. Nation, I. S. P. (2006). How large a vocabulary is needed for reading and listening? The Canadian Modern Language Review, 63, 59–82. doi:10.3138/cmlr.63.1.59 Nation, I. S. P. (2011). Research into practice: Vocabulary. Language Teaching, 44, 529–539. doi:10.1017/S0261444811000267 Nation, I. S. P., & Beglar, D. (2007). A vocabulary size test. The Language Teacher, 31(7), 9–13.
418 Nation, I. S. P., & Webb, S. A. (2011). Researching and analyzing vocabulary. Boston, MA: Heinle, Cengage Learning. Nelson, T. O., & Dunlosky, J. (1994). Norms of paired-associate recall during multitrial learning of Swahili-English translation equivalents. Memory, 2, 325–335. doi:10.1080/09658219408258951 Norris, J., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and quantitative meta-analysis. Language Learning, 50, 417–528. doi:10.1111/0023-8333.00136 Nunan, D. (1992). Research methods in language learning. Cambridge, UK: Cambridge University Press. O’Neill, M., Rasor, R. A., & Bartz, W. R. (1976). Immediate retention of objective test answers as a function of feedback complexity. The Journal of Educational Research, 70, 72–75. Paige, D. D. (1966). Learning while testing. The Journal of Educational Research, 59, 276–277. Pashler, H., Rohrer, D., Cepeda, N. J., & Carpenter, S. K. (2007). Enhancing learning and retarding forgetting: Choices and consequences. Psychonomic Bulletin and Review, 14, 187–193. doi:10.3758/BF03194050 Pashler, H., Zarow, G., & Triplett, B. (2003). Is temporal spacing of tests helpful even when it inflates error rates? Journal of Experimental Psychology: Learning, Memory, and Cognition, 29, 1051–1057. doi:10.1037/0278-7393.29.6.1051
419 Pavlik, P. I., & Anderson, J. R. (2008). Using a model to compute the optimal schedule of practice. Journal of Experimental Psychology: Applied, 14, 101–117. doi:10.1037/1076-898X.14.2.101 Phye, G., & Andre, T. (1989). Delayed retention effect: Attention, perseveration, or both? Contemporary Educational Psychology, 14, 173–185. doi:10.1016/0361-476X(89)90035-0 Phye, G., & Baller, W. (1970). Verbal retention as a function of the informativeness and delay of informative feedback: A replication. Journal of Educational Psychology, 61, 380–381. doi:10.1037/h0029798 Phye, G., Gugliemella, J., & Sola, J. (1976). Effects of delayed retention on multiple-choice test performance. Contemporary Educational Psychology, 1, 26–36. doi:10.1016/0361-476X(76)90004-7 Pickering, M. (1982). Context-free and context-dependent vocabulary learning: An experiment. System, 10, 79–83. doi:10.1016/0346-251X(81)90070-1 Pimsleur, P. (1967). A memory schedule. Modern Language Journal, 51, 73–75. doi:10.1111/j.1540-4781.1967.tb06700.x Prince, P. (1996). Second language vocabulary learning: The role of context versus translations as a function of proficiency. Modern Language Journal, 80, 478–493. doi:10.1111/j.1540-4781.1996.tb05468.x Pyc, M. A., & Rawson, K. A. (2007). Examining the efficiency of schedules of distributed retrieval practice. Memory & Cognition, 35, 1917–1927. doi:10.3758/BF03192925
420 Pyc, M. A., & Rawson, K. A. (2009). Testing the retrieval effort hypothesis: Does greater difficulty correctly recalling information lead to higher levels of memory? Journal of Memory and Language, 60, 437–447. doi:10.1016/j.jml.2009.01.004 Pyc, M. A., & Rawson, K. A. (2011). Costs and benefits of dropout schedules of test-restudy practice: Implications for student learning. Applied Cognitive Psychology, 25, 87–95. doi:10.1002/acp.1646 Quizlet LLC. (2013). Our Mission. Quizlet. Retrieved March 8, 2013, from http://quizlet.com/mission Rädle, P. (2008). VTrain (Vocabulary Trainer) --- Awards. VTrain.net. Retrieved April 4, 2013, from http://www.vtrain.net/aw.htm Rawson, K. A., & Dunlosky, J. (2011). Optimizing schedules of retrieval practice for durable and efficient learning: How much is enough? Journal of Experimental Psychology: General, 140, 283–302. doi:10.1037/a0023956 Rock, I. (1957). The role of repetition in associative learning. The American Journal of Psychology, 70, 186–193. Rodriguez, M., & Sadoski, M. (2000). Effects of rote, context, keyword, and context/keyword methods on retention of vocabulary in EFL classrooms. Language Learning, 50, 385–412. doi:10.1111/0023-8333.00121 Roediger, H. L., Agarwal, P. K., Kang, S. H. K., & Marsh, E. J. (2010). Benefits of testing memory: Best practices and boundary conditions. In G. M. Davies & D.
421 B. Wright (Eds.), New frontiers in applied memory (pp. 13–49). Brighton, UK: Psychology Press. Roediger, H. L., & Karpicke, J. D. (2010). Intricacies of spaced retrieval: A resolution. In A. S. Benjamin (Ed.), Successful remembering and successful forgetting: A festschrift in honor of Robert A. Bjork (pp. 23–47). New York, NY: Psychology Press. Roediger, H. L., Putnam, A. L., & Smith, M. A. (2011). Ten benefits of testing and their applications to educational practice. In J. Mestre & B. Ross (Eds.), Psychology of learning and motivation: Cognition in education (pp. 1–36). Oxford, UK: Elsevier. Rogers, J., Webb, S. A., & Nakata, T. (2013). Can explicitly teaching cognates increase the rate of second language vocabulary learning? Manuscript submitted for publication. Rohrer, D., & Pashler, H. (2007). Increasing retention without increasing study time. Current Directions in Psychological Science, 16, 183–186. doi:10.1111/j.1467-8721.2007.00500.x Rohrer, D., Taylor, K., Pashler, H., Wixted, J. T., & Cepeda, N. J. (2005). The effect of overlearning on long-term retention. Applied Cognitive Psychology, 19, 361–374. doi:10.1002/acp.1083 Roodenrys, S., & Hinton, M. (2002). Sublexical or lexical effects on serial recall of nonwords? Journal of Experimental Psychology: Learning, Memory, and Cognition, 28, 29–33. doi:10.1037/0278-7393.28.1.29
422 Royer, J. M. (1973). Memory effects for test-like-events during acquisition of foreign language vocabulary. Psychological Reports, 32, 195–198. doi:10.2466/pr0.1973.32.1.195 Salisbury, D. F., & Klein, J. D. (1988). A comparison of a microcomputer progressive state drill and flashcards for learning paired associates. Journal of Computer-Based Instruction, 15, 136–143. Sassenrath, J. M., & Yonge, G. D. (1969). Effects of delayed information feedback and feedback cues in learning on delayed retention. Journal of Educational Psychology, 60, 174–177. doi:10.1037/h0027618 Schmidt, R. A., & Bjork, R. A. (1992). New conceptualizations of practice: Common principles in three paradigms suggest new concepts for training. Psychological Science, 3, 207–217. doi:10.1111/j.1467-9280.1992.tb00029.x Schmitt, N. (1997). Vocabulary learning strategies. In N. Schmitt & M. McCarthy (Eds.), Vocabulary: Description, acquisition and pedagogy (pp. 199–227). Cambridge, UK: Cambridge University Press. Schmitt, N. (1998). Tracking the incremental acquisition of second language vocabulary: A longitudinal study. Language Learning, 48, 281–317. doi:10.1111/1467-9922.00042 Schmitt, N. (2000). Vocabulary in language teaching. Cambridge, UK: Cambridge University Press.
423 Schmitt, N. (2007). Current trends in vocabulary learning and teaching. In J. Cummins & C. Davison (Eds.), The international handbook of English language teaching (pp. 827–842). Norwell, MA: Springer. Schmitt, N. (2008). Review article: Instructed second language vocabulary learning. Language Teaching Research, 12, 329–363. doi:10.1177/1362168808089921 Schmitt, N., & Schmitt, D. (1995). Vocabulary notebooks: Theoretical underpinnings and practical suggestions. ELT Journal, 49, 133–143. doi:10.1093/elt/49.2.133 Schneider, V. I., Healy, A. F., & Bourne, L. E. (2002). What is learned under difficult conditions is hard to forget: Contextual interference effects in foreign vocabulary acquisition, retention, and transfer. Journal of Memory and Language, 46, 419–440. doi:/10.1006/jmla.2001.2813 Schuetze, U., & Weimer-Stuckmann. (2010). Virtual Vocabulary: Research and learning in lexical processing. CALICO Journal, 27, 517–528. Schuetze, U., & Weimer-Stuckmann, G. (2011). Retention in SLA lexical processing. CALICO Journal, 28, 460–472. Seibert, L. C. (1927). An experiment in learning French vocabulary. Journal of Educational Psychology, 18, 294–309. doi:10.1037/h0074206 Seibert, L. C. (1930). An experiment on the relative efficiency of studying French vocabulary in associated pairs versus studying French vocabulary in context. Journal of Educational Psychology, 21, 297–314. doi:10.1037/h0070517 Seibert, L. C. (1932). A series of experiments on the learning of French vocabulary. Baltimore, MD: The Johns Hopkins Press.
424 Shaughnessy, J. J., & Zechmeister, E. B. (1992). Memory-monitoring accuracy as influenced by the distribution of retrieval practice. Bulletin of the Psychonomic Society, 30, 125–128. Siegel, M. A., & Misselt, A. L. (1984). Adaptive feedback and review paradigm for computer-based drills. Journal of Educational Psychology, 76, 310–317. doi:10.1037/0022-0663.76.2.310 Skinner, B. F. (1954). The science of learning and the art of teaching. Harvard Educational Review, 24, 86–97. Slamecka, N. J., & Graf, P. (1978). The generation effect: Delineation of a phenomenon. Journal of Experimental Psychology: Human Learning and Memory, 4, 592–604. doi:10.1037/0278-7393.4.6.592 Sobel, H. S., Cepeda, N. J., & Kapler, I. V. (2011). Spacing effects in real-world classroom vocabulary learning. Applied Cognitive Psychology, 25, 763–767. doi:10.1002/acp.1747 Son, L. K., & Simon, D. A. (2012). Distributed learning: Data, metacognition, and educational implications. Educational Psychology Review, 24, 379–399. doi:10.1007/s10648-012-9206-y Steinel, M. P., Hulstijn, J. H., & Steinel, W. (2007). Second language idiom learning in a paired-associate paradigm: Effects of direction of learning, direction of testing, idiom imageability, and idiom transparency. Studies in Second Language Acquisition, 29, 449–484. doi:10.1017/S0272263107070271
425 Stoddard, G. D. (1929). An experiment in verbal learning. Journal of Educational Psychology, 20, 452–457. doi:10.1037/h0073293 Storm, B. C., Bjork, R. A., & Storm, J. C. (2010). Optimizing retrieval as a learning event: When and why expanding retrieval practice enhances long-term retention. Memory & Cognition, 38, 244–253. doi:10.3758/MC.38.2.244 Sturges, P. T. (1969). Verbal retention as a function of the informativeness and delay of informative feedback. Journal of Educational Psychology, 60, 11–14. doi:10.1037/h0026638 Sturges, P. T. (1972). Information delay and retention: Effect of information in feedback and tests. Journal of Educational Psychology, 63, 32–43. doi:10.1037/h0032158 Sturges, P. T., Sarafino, E. P., & Donaldson, P. L. (1968). Delay-retention effect and informative feedback. Journal of Experimental Psychology, 78, 357–358. doi:10.1037/h0026377 Surber, J. R., & Anderson, R. C. (1975). Delay-retention effect in natural classroom settings. Journal of Educational Psychology, 67, 170–173. doi:10.1037/h0077003 Swindell, L. K., & Walls, W. F. (1993). Response confidence and the delay retention effect. Contemporary Educational Psychology, 18, 363–375. doi:10.1006/ceps.1993.1026
426 Tamaki, K. (2007). Incorporating Nintendo DS into the curriculum leads to marked improvement in English vocabulary: Yawata City Board of Education, Kyoto. Mainichi Newspaper, p. 28. The Mainichi Newspapers. Thomas, M. H., & Dieter, J. N. (1987). The positive effects of writing practice on integration of foreign words in memory. Journal of Educational Psychology, 79, 249–253. doi:10.1037/0022-0663.79.3.249 Thorndike, E. L. (1908). Memory for paired associates. Psychological Review, 15, 122–138. doi:10.1037/h0073570 Tinkham, T. (1989). Rote learning, attitudes, and abilities: A comparison of Japanese and American students. TESOL Quarterly, 23, 695–698. doi:10.2307/3587547 Tinkham, T. (1993). The effect of semantic clustering on the learning of second language vocabulary. System, 21, 371–380. doi:10.1191/026765897672376469 Tinkham, T. (1997). The effects of semantic and thematic clustering on the learning of second language vocabulary. Second Language Research, 13, 138–163. doi:10.1191/026765897672376469 Tsai, L. S. (1927). The relation of retention to the distribution of relearning. Journal of Experimental Psychology, 10, 30–39. doi:10.1037/h0071614 Tulving, E. (1967). The effects of presentation and recall of material in free-recall learning. Journal of Verbal Learning and Verbal Behavior, 6, 175–184. doi:10.1016/S0022-5371(67)80092-6 Underwood, B. J., & Keppel, G. (1962). One-trial learning? Journal of Verbal Learning & Verbal Behavior, 1, 1–13. doi:10.1016/S0022-5371(62)80013-9
427 Van Bussel, F. J. J. (1994). Design rules for computer-aided learning of vocabulary items in a second language. Computers in Human Behavior, 10, 63–76. doi:10.1016/0747-5632(94)90029-9 Wang, A., & Thomas, M. (1995). Effect of keywords on long-term retention: Help or hindrance? Journal of Educational Psychology, 87, 468–475. doi:10.1037/0022-0663.87.3.468 Waring, R. (1997a). A study of receptive and productive learning from word cards. Studies in Foreign Languages and Literature (Notre Dame Seishin University, Okayama), 21, 94–114. Waring, R. (1997b). The negative effects of learning words in semantic sets: A replication. System, 25, 261–274. doi:10.1016/S0346-251X(97)00013-4 Waring, R. (2004). In defence of learning words in word pairs: But only when doing it the “right” way! Retrieved April 30, 2013, from http://www.robwaring.org/vocab/principles/systematic_learning.htm Waring, R., & Takaki, M. (2003). At what rate do learners learn and retain new vocabulary from a graded reader? Reading in a Foreign Language, 15, 130–163. Watkins, M. J. (1972). Locus of the modality effect in free recall. Journal of Verbal Learning and Verbal Behavior, 11, 644–648. doi:10.1016/S0022-5371(72)80048-3
428 Webb, J. M., Stock, W. A., & McCarthy, M. T. (1994). The effects of feedback timing on learning facts: The role of response confidence. Contemporary Educational Psychology, 19, 251–265. doi:10.1006/ceps.1994.1020 Webb, S. A. (2002). Investigating the effects of learning tasks on vocabulary knowledge (Unpublished doctoral dissertation). Victoria University of Wellington, New Zealand. Webb, S. A. (2005). Receptive and productive vocabulary learning: The effects of reading and writing on word knowledge. Studies in Second Language Acquisition, 27, 33–52. doi:10.1017/S0272263105050023 Webb, S. A. (2007a). Learning word pairs and glossed sentences: The effects of a single context on vocabulary knowledge. Language Teaching Research, 11, 63–81. doi:10.1177/1362168806072463 Webb, S. A. (2007b). The effects of repetition on vocabulary knowledge. Applied Linguistics, 28, 46–65. doi:10.1093/applin/aml048 Webb, S. A. (2008). Receptive and productive vocabulary sizes of L2 learners. Studies in Second Language Acquisition, 30, 79–95. doi:10.1017/S0272263108080042 Webb, S. A. (2009a). The effects of pre-learning vocabulary on reading comprehension and writing. Canadian Modern Language Review, 65, 441–470. doi:10.1353/cml.0.0046
429 Webb, S. A. (2009b). The effects of receptive and productive learning of word pairs on vocabulary knowledge. RELC Journal, 40, 360–376. doi:10.1177/0033688209343854 Webb, S. A. (2012). Depth of vocabulary knowledge. In C. Chapelle (Ed.), Encyclopedia of Applied Linguistics (pp. 1656–1663). Oxford, UK: Wiley-Blackwell. Webb, W. B. (1962). The effects of prolonged learning on learning. Journal of Verbal Learning and Verbal Behavior, 1, 173–182. doi:10.1016/S0022-5371(62)80026-7 Webber, N. E. (1978). Pictures and words as stimuli in learning foreign language responses. The Journal of Psychology, 98, 57–63. doi:10.1080/00223980.1978.9915946 Wissman, K. T., Rawson, K. A., & Pyc, M. A. (2012). How and when do students use flashcards? Memory, 20, 568–579. doi:10.1080/09658211.2012.687052 Woodworth, R. S., & Schlosberg, H. (1954). Experimental Psychology (3rd ed.). London, UK: Methuen & Co. Zaromb, F. M., & Roediger, H. L. (2010). The testing effect in free recall is associated with enhanced organizational processes. Memory & Cognition, 38, 995–1008. doi:10.3758/MC.38.8.995