Abstract
Human memory is imperfect; thus, periodic review is required for the long-term preservation of knowledge and skills. However, students at every educational level are challenged by an ever-growing amount of material to review and an ongoing imperative to master new material. We developed a method for efficient, systematic, personalized review that combines statistical techniques for inferring individual differences with a psychological theory of memory. The method was integrated into a semester-long middle-school foreign-language course via retrieval-practice software. Using a cumulative exam administered after the semester’s end, we compared time-matched review strategies and found that personalized review yielded a 16.5% boost in course retention over current educational practice (massed study) and a 10.0% improvement over a one-size-fits-all strategy for spaced study.
Keywords
Forgetting is ubiquitous. Regardless of the nature of the skills or material being taught, regardless of the age or background of the learner, forgetting happens. Teachers rightfully focus their efforts on helping students acquire new knowledge and skills, but newly acquired information is vulnerable and easily slips away. Even highly motivated learners are not immune: Medical students forget roughly 25% to 35% of basic science knowledge after 1 year, more than 50% by the next year (Custers, 2010), and 80% to 85% after 25 years (Custers & ten Cate, 2011).
Forgetting is influenced by the temporal distribution of study. For more than a century, psychologists have noted that temporally spaced practice leads to more robust and durable learning than massed practice (Cepeda, Pashler, Vul, Wixted, & Rohrer, 2006). Although spaced practice is beneficial in many tasks beyond rote memorization (Kerfoot et al., 2010) and shows promise in improving educational outcomes (Dunlosky, Rawson, Marsh, Nathan, & Willingham, 2013), the reward structure of academic programs seldom provides an incentive to methodically revisit previously learned material. Teachers commonly introduce material in sections and evaluate students at the completion of each section; consequently, students’ grades are well served by focusing study exclusively on the current section. Although optimal in terms of students’ short-term goals, this strategy is costly for the long-term goal of maintaining accessibility of knowledge and skills. Other obstacles also stand in the way of incorporating distributed practice into the curriculum. Students who are in principle willing to commit time to review can be overwhelmed by the amount of material, and their metacognitive judgments about what they should study may be unreliable (Nelson & Dunlosky, 1991). Moreover, though teachers recognize the need for review, the time demands of restudying old material compete with the imperative to regularly introduce new material.
Method
We incorporated systematic, temporally distributed review into third-semester, eighth-grade Spanish foreign-language instruction using a Web-based flash-card tutoring system, the Colorado Optimized Language Tutor (COLT). Throughout the semester, 179 students used COLT to drill on 10 chapters of material, which were introduced at approximately 1-week intervals. COLT presented vocabulary words and short sentences in English and required students to type the Spanish translations, after which corrective feedback was provided. The software was used both to practice newly introduced material and to review previously studied material. More information about the software and semester schedule can be found in the Experimental Methods section of Additional Methods and Results in the Supplemental Material available online.
For each chapter of course material, students engaged in three 20- to 30-min sessions with COLT during class time. The first two sessions began with a study-to-proficiency phase for the current chapter and then proceeded to a review phase. In the third session, these activities were preceded by a quiz on the current chapter, which counted toward the course grade. During the review phase of each session, study items from all chapters covered so far in the course were eligible for presentation. Selection of items for the review phase was handled by three different schedulers.
The massed scheduler continued to select material from the current chapter. It presented the item in the current chapter that students had least recently studied. This scheduler corresponds to recent educational practice: Prior to the introduction of COLT, the educational software used by these students allowed them to select the chapter they wished to study. Not surprisingly, given a choice, students focused their effort on preparing for the imminent end-of-chapter quiz, which is consistent with the preference for massed study found by Cohen, Yan, Halamish, and Bjork (2013).
The generic spaced scheduler selected one previous chapter to review at a spacing deemed to be optimal for a range of students and a variety of material, according to both empirical studies (Cepeda et al., 2006; Cepeda, Vul, Rohrer, Wixted, & Pashler, 2008) and computational models (Khajah, Lindsey, & Mozer, 2013; Mozer, Pashler, Cepeda, Lindsey, & Vul, 2009). Given the time frame of a semester—during which material must be retained for 1 to 3 months—a 1-week lag between initial study and review results in near-peak performance for a range of declarative materials. To achieve this lag, the generic spaced scheduler selected review items from the previous chapter, giving priority to the least recently studied items (Fig. 1).

Trial allocation of the three review schedulers. Course material was introduced one chapter at a time, generally at 1-week intervals. Each vertical slice indicates the across-student average proportion of trials spent in a given week studying each of the chapters introduced up to that point. (Each slice includes trials from both the study-to-proficiency and the review phases.) Each chapter is indicated by a unique color.
The personalized spaced scheduler used a latent-state Bayesian model to predict what specific material a particular student would most benefit from reviewing. This model infers the instantaneous memory strength of each item the student has studied. The inference problem is difficult because past observations of a particular student studying a particular item provide only a weak source of evidence concerning memory strength. For example, suppose that a student has practiced an item twice, failing to get the correct answer 15 days ago but succeeding 9 days ago. Given these sparse observations, it would seem that one cannot reliably predict the student’s current ability regarding the item. However, data from the population of students studying the population of items over time can provide constraints helpful in characterizing the performance of a specific student for a specific item at a given moment. Our model-based approach is related to that used by e-commerce sites that leverage their entire database of past purchases to make individualized recommendations, even when customers have sparse purchase histories.
The model we used defines memory strength as being jointly dependent on factors relating to (a) an item’s latent difficulty, (b) a student’s latent ability, and (c) the amount, timing, and outcome of past study. We refer to the model with the acronym DASH (i.e.,
The scheduler was varied within participants by randomly assigning one third of a chapter’s items to each scheduler, with assignment counterbalanced across participants. During review, the schedulers alternated in selecting items for retrieval practice. Each scheduler selected from among the items assigned to it, ensuring that all items had equal opportunity. All schedulers administered an equal number of review trials. Figure 1 and Table 1 present statistics of how often and when individual items were studied by individual students for each scheduler over the time course of the experiment. More information about the experimental procedure, subject pool, and study materials can be found in Materials, Procedure, and Participants in the Supplemental Material available online.
Presentation Statistics of the Three Schedulers for Individual Students on Individual Items
Results
Two proctored cumulative exams were administered to assess retention, one at the semester’s end and one 28 days later, at the beginning of the following semester. Each exam tested half of the course material, with items randomly selected for each student and balanced across chapters and schedulers; no corrective feedback was provided. On the first exam, retention for items assigned to the personalized spaced scheduler was 12.4% higher than retention for items assigned to the massed scheduler, t(169) = 1.01, p < .001, Cohen’s d = 1.38, and 8.3% better than retention for items assigned to the generic spaced scheduler, t(169) = 8.2, p < .001, Cohen’s d = 1.05 (Fig. 2a). Over the 28-day intersemester break, the forgetting rate was 18.1%, 17.1%, and 15.7% for the massed, generic spaced, and personalized spaced conditions, respectively, so that the advantage of personalized review became even larger. On the second exam, personalized review boosted retention by 16.5% over massed review, t(175) = 1.11, p < .001, Cohen’s d = 1.42, and by 10.0% over generic review, t(175) = 6.59, p < .001, Cohen’s d = 0.88 (Fig. 2a).

Scores on the cumulative end-of-semester exams. The bar graph (a) presents mean score as a function of condition for each of the two exams separately. The line graph (b) presents mean score across the two exams as a function of the chapter in which the material was introduced, separately for each condition. Error bars indicate ±1 SE, calculated within subjects (Masson & Loftus, 2003).
The schedulers had their primary impact for material introduced earlier in the semester (Fig. 2b), which makes sense because memory for that material had the most opportunity to be manipulated via review. The personalized spaced scheduler produced a large benefit for early chapters in the semester without sacrificing efficacy for later chapters. Among students who took both exams, only 22.3% and 13.5% scored better in the generic spaced and massed conditions, respectively, than in the personalized spaced condition.
Note that massed review was spaced by usual laboratory standards, being spread out over at least 6 days (new material was introduced on a Friday and practiced until Wednesday or Thursday the following week). This fact may explain both the small benefit of the generic spaced over the massed scheduler and the absence of a spacing effect (generic and personalized spaced schedulers outperforming the massed scheduler) for the final chapters (see Fig. 2).
DASH infers three factors contributing to recall success: an item’s difficulty, a student’s ability, and the study history of the specific student on the specific item. Histograms of these inferred contributions showed substantial variability (Fig. 3), so decisions about what items to review were markedly different across individual students and items.

Histograms of three inferred factors, expressed in terms of their additive contribution to predicted log odds of recall. Each factor varies over 3 log units, which corresponds to a possible modulation of .65 in recall probability.
DASH predicts a student’s response accuracy for an item at a point in time given the response history of all students and items to that point. To evaluate the quality of DASH’s predictions, we compared DASH against alternative models by dividing the 597,990 retrieval practice trials recorded over the semester into 100 temporally contiguous disjoint sets; we then used the models to predict the data for each set given the preceding sets. The accumulative prediction error (Wagenmakers, Grünwald, & Steyvers, 2006) was computed using the mean deviation between the model’s predicted recall probability and the actual binary outcome, normalized such that each student was weighted equally. Figure 4 compares DASH against five alternatives: a baseline model that predicted a student’s future performance to be the proportion of correct responses the student had made in the past, a Bayesian form of item-response theory (IRT; De Boeck & Wilson, 2004), a model of spacing effects based on the memory component of ACT-R (Pavlik & Anderson, 2005), and two variants of DASH that incorporate alternative representations of study history motivated by models of spacing effects (ACT-R, multiscale context model). Details of the alternative models, model evaluations, and additional analyses of the experimental results are available in Additional Methods and Results in the Supplemental Material.

Accumulative prediction error of six models using the data from the semester-long experiment. The models are as follows: a baseline model that predicts performance from the proportion of correct responses made by each student, a model based on item-response theory (IRT), a model based on Pavlik and Anderson’s (2005) ACT-R model, DASH, and variants of DASH including components of ACT-R and the multiscale context model (MCM). Error bars indicate ±1 SEM.
The three variants of DASH performed better than the alternatives. Each variant had two key components: (a) a dynamic representation of study history that characterized learning and forgetting and (b) a Bayesian approach to inferring latent difficulty and ability factors. Models that omitted the first component (baseline and IRT) or the second component (baseline and ACT-R) did not fare as well. The DASH variants all performed similarly. Because these variants differed only in the manner in which the temporal distribution of study and recall outcomes was represented, this distinction does not appear to be critical.
Discussion
Our work builds on the rich history of applied human-learning research by integrating two distinct threads: classroom-based studies that compare massed with spaced presentation of material (Carpenter, Pashler, & Cepeda, 2009; Seabrook, Brown, & Solity, 2005; Sobel, Cepeda, & Kapler, 2011) and laboratory-based investigations of adaptive scheduling techniques, which are used to select material for an individual to study on the basis of that individual’s past study history and performance (e.g., Atkinson, 1972).
Previous explorations of temporally distributed study in real-world educational settings targeted a relatively narrow body of course material to which participants were unlikely to be exposed outside the experimental context. Further, these studies compared just a few spacing conditions, and the spacing was the same for all participants and materials, as in our generic spaced condition.
Previous evaluations of adaptive scheduling have demonstrated the advantage of one algorithm over another or over nonadaptive algorithms (Metzler-Baddeley & Baddeley, 2009; Pavlik & Anderson, 2008; van Rijn, van Maanen, & van Woudenberg, 2009), but these evaluations have been confined to the laboratory and have spanned a relatively short time scale. The most ambitious previous experiment (Pavlik & Anderson, 2008) involved three study sessions in 1 week and a test the following week. This compressed time scale limited the opportunity to manipulate spacing in a manner that would influence long-term retention (Cepeda et al., 2008). Further, brief laboratory studies do not deal with the complex issues that arise in a classroom, such as the staggered introduction of material and the certainty of exposure to the material outside the experimental context.
Whereas previous studies offer in-principle evidence that human learning can be improved by the timing of review, our results demonstrate in practice that integrating personalized-review software into the classroom yields appreciable improvements in long-term educational outcomes. Our experiment went beyond past efforts in its scope: It spanned the time frame of a semester, covered the content of an entire course, and introduced material in a staggered fashion and in coordination with other course activities. We find it remarkable that the review manipulation had as large an effect as it did, considering that the duration of roughly 30 min a week was only about 10% of the time students were engaged with the course. The additional, uncontrolled exposure to material from classroom instruction, homework, and the textbook might well have washed out the effect of the experimental manipulation.
Personalization
Consistent with the adaptive-scheduling literature, our experiment shows that a one-size-fits-all variety of review is significantly less effective than personalized review. The traditional means of encouraging systematic review in classroom settings—cumulative exams and assignments—is therefore unlikely to be ideal.
We acknowledge that our design confounded personalization and the coarse temporal distribution of review (Fig. 1, Table 1). However, indiscriminate review of older material is unlikely to be beneficial because it comes at the expense of newer material, and because time limitations permit the selection of only a small fraction of the ever-growing collection of candidate material.
Any form of personalization requires estimates of an individual’s memory strength for specific knowledge. Previously proposed adaptive-scheduling algorithms based their estimates on observations from only the given individual, whereas the approach taken here is fundamentally data driven, leveraging the large volume of quantitative data that can be collected in a digital learning environment to perform statistical inference on the knowledge states of individuals at an atomic level. This leverage is critical to obtaining accurate predictions (Fig. 4).
Outside the academic literature, two traditional adaptive-scheduling techniques have attracted a degree of popular interest: the Leitner (1972) system and SuperMemo (Wozniak & Gorzelanczyk, 1994). Both aim to present material for review when it is on the verge of being forgotten. As long as each retrieval attempt succeeds, both techniques yield a schedule in which the interpresentation interval expands with each successive presentation. These techniques underlie many flash-card-type Web sites and mobile applications, which are marketed with the claim of optimizing retention. Though one might expect that any form of review would show some benefit, the claims have not yet undergone formal evaluation in actual usage, and given our comparison of techniques for modeling memory strength, we suspect that there is room for improving these two traditional techniques.
Beyond fact learning
Our approach to personalization depends only on the notion that understanding and skill can be cast in terms of collections of primitive knowledge components, or KCs (VanLehn, Jordan, & Litman, 2007), and that observed student behavior permits inferences about the state of these KCs. The approach is flexible, allowing for any problem posed to a student to depend on arbitrary combinations of KCs. The approach is also general, having application beyond declarative learning to domains focused on conceptual, procedural, and skill learning.
Educational failure at all levels often involves knowledge and skills that were once mastered but cease to be accessible because of lack of appropriately timed rehearsal. Although it is common to pay lip service to the benefits of review, comprehensive and appropriately timed review is beyond what any teacher or student can reasonably arrange. Our results suggest that a digital tool that solves this problem in a practical, time-efficient manner will yield major payoffs for formal education at all levels.
Footnotes
Appendix
Acknowledgements
We thank F. Craik, A. Glass, J. L. McClelland, H. L. Roediger, III, and P. Wozniak for valuable feedback on the manuscript.
Declaration of Conflicting Interests
The authors declared that they had no conflicts of interest with respect to their authorship or the publication of this article.
Funding
This research was supported by a National Science Foundation (NSF) Graduate Research Fellowship, by NSF Grants SBE-0542013 and SMA-1041755, and by a collaborative-activity award from the McDonnell Foundation.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
