Abstract
Taking a memory test after an initial study phase produces better long-term retention than restudying the items, a phenomenon known as the testing effect. We propose that this effect emerges because testing strengthens semantic features of items’ memory traces, whereas restudying strengthens surface features of items’ memory traces. This novel account predicts that a testing effect should be observed even after a short retention interval when a language switch occurs between the learning phase and the final test phase. We assessed this prediction with Dutch-English bilinguals who learned Dutch Deese-Roediger-McDermott word lists through restudying or through testing (retrieval practice). Five minutes after this learning phase, they took a recognition test in Dutch (within-language condition) or in English (across-language condition). We observed a testing effect in the across-language condition, but not in the within-language condition. These findings corroborate our novel account of the testing effect.
Keywords
In recent years, there has been an explosion of research on the beneficial effect of retrieval practice on memory. A robust finding that has emerged from a large number of studies is that taking one or more intervening tests after an initial encoding (study) episode produces better retention of the to-be-remembered material than does restudying the same material for an equivalent amount of time. This phenomenon is known as the testing effect (for recent reviews, see Delaney, Verkoeijen, & Spirgel, 2010; Roediger & Butler, 2011; and Roediger & Karpicke, 2006a).
A typical experiment on this effect compares retention of restudied information and tested information at multiple retention intervals (RIs). 1 For short RIs of several minutes, these experiments usually show either that restudying has a memory advantage over testing or that restudied and previously tested items are remembered equally well. However, after an RI of multiple days, previously tested items are remembered better than restudied items. Furthermore, a significant interaction between RI and learning procedure (restudy vs. testing) emerges because forgetting is slower for tested than for restudied items. This interaction has been replicated many times with different stimulus materials and with a variety of test types (e.g., Carpenter & Pashler, 2007; Coppens, Verkoeijen, & Rikers, 2011; Hogan & Kintsch, 1971; Roediger & Karpicke, 2006b; Thompson, Wenger, & Bartling, 1978; Toppino & Cohen, 2009; Wheeler, Ewers, & Buonanno, 2003).
Although the testing effect has been replicated repeatedly, the majority of the studies in the literature have had an empirical rather than a theoretical focus. Only in the past few years have researchers started to systematically investigate the mechanism (or mechanisms) underlying the testing effect (e.g., Carpenter, 2009; Carpenter, 2011; Pyc & Rawson, 2010). However, despite these efforts, researchers still know little about why testing (retrieval practice) enhances retention. To contribute to the development of a theoretical explanation of the testing effect, we (Bouwmeester & Verkoeijen, 2011) proposed that testing and restudying are operations that tap into different types of memory traces. Our account is rooted in fuzzy-trace theory (see Brainerd & Reyna, 2004; Brainerd, Reyna, & Ceci, 2008; Reyna & Brainerd, 1995), but it also fits well within other memory models. The goal of the present study was to assess an important prediction of our novel account of the testing effect.
In memory models such as Search of Associative Memory (SAM; Raaijmakers & Shiffrin, 1981), Retrieving Effectively From Memory (REM; Shiffrin, 2003), Processing Implicit and Explicit Representations (PIER; Nelson, Schreiber, & McEvoy, 1992), and fuzzy-trace theory, episodic memory traces are represented as collections of both surface and semantic features. Consider a participant who is presented with the word tree during a verbal learning task. The surface features of this word are its visual (orthographic) and phonological characteristics. By contrast, its semantic features are representations of its meaning. These semantic features include the concept “tree,” to which the word tree refers, but also the concepts that are semantically related to the concept “tree” (see Griffiths, Steyvers, & Tenenbaum, 2007, for an overview of different types of semantic representations). We propose that retrieval practice strengthens semantic features of memory traces because participants use semantic cues, such as word meanings or semantic associates, to recover information from memory. By contrast, participants receive more exposure to the surface features of stimuli during restudy, and as a result, restudy strengthens surface features of memory traces more than retrieval practice does.
Our account predicts that testing will produce better memory performance than restudying whenever participants have to rely primarily on semantic cues during a final test. This occurs when the final test is administered after a multiday RI. Indeed, a typical finding from the testing literature is that the testing effect emerges only after a long delay between the learning phase and the final test. However, our account predicts that under certain conditions, a testing effect will also be found after a short RI of several minutes. Specifically, a short-term testing effect should emerge if the final test prevents participants from using surface retrieval cues, and instead requires them to rely on semantic retrieval cues. We conducted the present study to test this prediction.
In our study, Dutch-English bilingual adults learned a number of Dutch Deese-Roediger-McDermott (DRM; Deese, 1959; Roediger & McDermott, 1995) lists through restudying or through retrieval practice. A DRM list contains target words that are all semantically related to a particular nonpresented distractor word (lure). Subsequently, after a 5-min RI, half of the participants received a final yes/no recognition test in English (across-language condition), whereas the other half of the participants received the test in Dutch (within-language condition). In the across-language condition, the final test provided only partial information about the targets, because participants were cued with word meanings (note that we assume Dutch words and their English translations have a common semantic representation; e.g., Sahlin, Harding, & Seamon, 2005; Zeelenberg & Pecher, 2003) but not with the surface features of the studied words (cf. Howe, Gagnon, & Thouas, 2008). By contrast, in the within-language condition, the final test cued participants with both surface information (the visual presentation of the target words) and semantic information.
We expected to find a testing effect in the across-language condition, because participants in this condition had to rely on semantic cues, and according to our account, retrieval practice strengthens semantic features of memory traces more than restudying does. In the within-language condition, we expected to obtain one of the results commonly obtained in studies on the testing effect, namely, either an advantage for restudying or no difference between restudying and testing. Furthermore, if semantic cues drive the recognition of items learned through testing, we would find the same final-test recognition performance for these items in the across-language condition and in the within-language condition. And if recognition of restudied items depends more strongly on surface cues, performance on restudied items would be worse in the across-language condition than in the within-language condition.
Finally, according to our account of the testing effect, retrieval practice should also strengthen concepts semantically related to the studied items more than restudying does. Hence, compared with restudy, retrieval practice should yield a higher false alarm rate to related distractors on the final recognition test. However, we expected that it might be difficult to detect this predicted difference because with adult participants, DRM lists tend to produce very high levels of false alarms to related distractors.
Method
Participants and design
Participants were 64 individually tested Dutch psychology undergraduates at the Erasmus University Rotterdam. They took part in this experiment to fulfill a course requirement. These participants can be considered bilinguals with respect to the words presented at the final test (see Zeelenberg & Pecher, 2003, for relevant arguments). We used a 2 (learning procedure: restudy vs. testing) × 2 (final test: across language vs. within language) mixed design with repeated measures on the first factor. Participants were randomly assigned to the final-test condition.
Materials
For the learning phase of this study, we selected 12 DRM study lists from Zeelenberg and Pecher’s (2002) Dutch translations of Stadler, Roediger, and McDermott’s (1999) word lists. Each list comprised eight words, each of which had a strong backward association with the same semantically related distractor. The 12 lists were randomly split in two sets of 6 lists. We counterbalanced order of learning procedure (i.e., restudy first vs. testing first) and presentation order of the two lists (Set 1 first vs. Set 2 first), so that there were four study sequences for both the across-language condition and the within-language condition.
For the final recognition test, we created a list of 72 words: 36 target words (3 words from each list), the 12 semantically related distractors, and 24 unrelated distractors. In the across-language condition, the final test was administered in English, whereas in the within-language condition, the final test was administered in Dutch. We made sure that the surface features of the English test items and their Dutch counterparts did not overlap.
Procedure
At the start of the experiment, participants were informed that they would be presented with word lists of eight words each, and that they should try to remember as many words as possible for an upcoming, unspecified memory test. The learning phase for each list began with an initial study phase in which the words were presented in the center of a computer screen at a rate of 4 s per word, with a 1-s interstimulus interval. Participants were instructed to type each word and to memorize it. After this initial study phase, either the list was presented again using the same procedure (restudy) or a free-recall test was administered (testing). During the free-recall test, participants were asked to type as many words as they could remember from the previously studied list. Participants were given 4 s to type each word, and there was a 1-s interval between each of the eight response opportunities. Participants first completed either a block of six restudy lists or a block of six tested lists. After a short break, they completed a block in which they learned the other six lists using the other learning procedure.
Following the second block, participants engaged in a 2-min distractor task in which they counted backward by 3s from a given number. After the distractor task, the final yes/no recognition task was administered either in Dutch or in English. Words were presented one by one on the computer screen, in random order, and participants had to indicate for each word whether it had been presented during the learning phase (i.e., an “old” judgment) or was new. The final test was self-paced, and a new test item appeared after a participant made a response.
Results and Discussion
A significance level (p) of .05 was used as the threshold for statistical significance. The reported t-tests were two-tailed.
Test performance during the learning phase
The mean proportion of correctly retrieved words from the tested lists during the learning phase did not differ between the across-language condition (M = .80, SD = .13) and the within-language condition (M = .83, SD = .10), t(62) = 0.83, p = .41, d = 0.18.
Final-test performance
For each participant, we calculated the proportion of unrelated distractors correctly classified as “new,” the proportion of targets correctly classified as “old,” and the proportion of related distractors incorrectly classified as “old” (i.e., the proportion of false memories). An independent t test revealed that the mean proportion of correctly classified unrelated distractors was comparable for the across-language condition (M = .91, SD = .10) and the within-language condition (M = .94, SD = .11), t(62) = 1.22, p = .22, d = 0.32. This finding suggests that participants in the two conditions employed a similar response criterion.
A 2 (learning procedure: restudy vs. testing) × 2 (final test: across language vs. within language) mixed-model analysis of variance (ANOVA) on the proportion of correctly classified targets (see Table 1) did not reveal significant main effects of learning procedure, F(1, 62) = 2.73, MSE = 0.02, p = .10,
Mean Proportion of Correct Recognition
Note: Standard errors are given in parentheses.
A 2 (learning procedure) × 2 (final test) mixed-model ANOVA on the proportion of incorrectly classified related distractors (i.e., the proportion of false memories; see Table 2) did not reveal significant main effects or a significant interaction (all Fs < 1, all
Mean Proportion of False Memories
Note: Standard errors are given in parentheses.
Discussion
The most important finding in this study was that a short-term testing effect emerged in the across-language condition, in which participants were prevented from using surface cues and instead had to rely exclusively on semantic cues provided by the English test words. By contrast, in the within-language condition, in which participants could base their recognition decisions on both surface and semantic cues, recognition was comparable for previously tested items and restudied items. The latter finding replicates the results typically reported for short RIs. Moreover, recognition memory for restudied items declined when the final-test language was different from the learning-phase language. However, this language switch did not influence recognition performance for items learned through testing. Taken together, these findings suggest that there are qualitative differences in how restudying and testing strengthen memory. Specifically, restudying strengthens surface features of items’ memory traces more than testing does, whereas testing strengthens semantic features of items’ memory traces more than restudying does.
Our findings for the target words seem to be inconsistent with the recently proposed distribution-based framework (e.g., Halamish & Bjork, 2011). According to this framework, a short-term testing effect will emerge when the final test is sufficiently difficult, that is, when the minimum memory strength required for retrieval of the items is sufficiently high. In the present study, final-test performance for previously tested items was similar in the across-language condition and the within-language condition. This indicates that the final tests in the two conditions were of comparable difficulty. Nevertheless, the language-switch manipulation gave rise to a testing effect. These results suggest that the type of final-test cue, rather than final-test difficulty, is a crucial factor in the emergence of the testing effect.
We also predicted that, compared with restudying, testing would increase the proportion of falsely recognized related distractors (i.e., the proportion of false memories). Contrary to this prediction, the proportion of false memories was similar for restudied and previously tested DRM lists. Perhaps this unexpected outcome arose because we employed relatively short, eight-word lists. Hence, during the immediate free-recall test, participants might have relied primarily on word meanings for item retrieval and much less on semantic associates, such as related distractors. Consequently, relative to restudying, retrieval practice might have resulted in only a small increase in the related distractors’ memory strength (see Sugrue, Strange, & Hayne, 2009, for a study showing that false recall increases with DRM list length). This, in turn, might have made it hard for us to detect differences in false memory between restudied and tested lists on the final recognition test. However, this account of our false-memory findings is post hoc, and further research is required to determine the conditions under which testing strengthens memory traces of related distractors more than restudying does.
To conclude, we note that our findings might inform classroom practice because they suggest that testing enhances meaningful, semantic processing of to-be-learned materials.
Footnotes
Acknowledgements
We thank Valerie Burken, Kirtie Dharampal, and Sanne van Roon for their assistance in collecting the data.
The authors declared that they had no conflicts of interest with respect to their authorship or the publication of this article.
