Abstract
Background:
Findings from the testing effect literature suggest several ways to achieve testing effects in an authentic classroom, but few consider instructor workload, equity, and resources that determine feasibility and sustainability of testing effect methods in practice.
Objective:
To determine elements and procedures from the testing effect literature for practical application, devise a method for feasibly and sustainably implementing testing effect methods in practice, and determine if a simple way to incorporate retrieval practice into an existing introduction to psychology course was sufficient to observe testing effects.
Method:
Quiz scores of Introductory Psychology sections with and without retrieval practice were compared. Sections with retrieval practice also compared the effects of repeated and new questions on quiz performance.
Results:
Students with retrieval practice performed significantly better on quizzes than those without. Repeated and new retrieval practice were equally superior.
Conclusion:
Retrieval practices can successfully be implemented, feasibly and sustainably, in an authentic classroom environment. Retrieval practice questions can be related to delayed practice questions, rather than exact repeats, to achieve a testing effect.
Teaching Implications:
Distributing low stakes multiple-choice questions throughout lectures is effective for increasing test performance. The current method was neither burdensome to workload, content, or resources.
Findings from the testing effect literature indicate that active retrieval of information through testing strengthens underlying connections and memory for that information (for reviews see Adesope et al., 2017; Roediger & Karpicke, 2006; Rowland, 2014; Schwieren et al., 2017; Tolentino Moreira et al., 2019). The basic testing effect method involves presentation of information, followed by an initial test (retrieval practice), restudy or an unrelated task, and then a delayed test. A testing effect occurs when delayed test performance is higher following retrieval practice than other conditions. In college courses, the addition of retrieval practice should enhance exam performance (e.g., McDaniel et al., 2007). The present study provides a practical method to achieve testing effects in an authentic college classroom that aims to facilitate learning while also considering fairness, instructor workload, and resources.
The current goals were to review findings from laboratory and classroom studies on testing effects for methodological adaption in an authentic introduction to psychology course. The first goal was to discern critical elements of retrieval practice and delayed testing that result in testing effects, emphasizing the value of retrieval practice accuracy and the relationship between retrieval practice and delayed questions. The second goal was to modify experimental methodology for adaptation into an ecologically valid classroom setting, emphasizing management of instructor workload, class time, and resources that accompany adding retrieval practice. The third goal was to test whether a simple method incorporating retrieval practice into an existing introduction to psychology course was sufficient to observe testing effects. Modifying experimental methods to fit an authentic classroom enables instructors to feasibly implement retrieval practice to benefit student performance across semesters.
Critical Elements for Testing Effects
Prior research indicates that retrieval practice accuracy facilitates delayed test performance (Jang et al., 2012; Kang et al., 2007; Pashler et al., 2005; Rowland & Delosh, 2014, Yang & Shanks, 2018). It follows then, that feedback at practice retrieval should benefit delayed test performance. Some have reported no clear advantage for feedback over no feedback (Butler & Roediger, 2007), but others have confirmed the feedback advantage (Agarwal et al., 2008; Cull, 2000; Kang, 2010; Vojdanoska et al., 2010), including two meta-analyses (Rowland, 2014; Schwieren et al., 2017). Applied to teaching, incorporating retrieval practice with feedback is likely to produce testing effects. Providing feedback is also more practical than ignoring student responses or perpetuating misunderstandings, which is antithetical to authentic teaching. Feedback on retrieval practice is likely to be beneficial, and assuredly does no harm (Adesope et al., 2017).
Additionally, similarity between retrieval practice and delayed tests affects strength of the testing effect. Similarity has been explored from three main theoretical perspectives. The Transfer Appropriate Processing theory (TAP) suggests that testing effects result from practice using a specific retrieval strategy, predicting testing effects when the initial and delayed testing questions match exactly (Nguyen & McDaniel, 2015; Wooldridge et al., 2014). However, sometimes a mismatch between initial and delayed tests results in a greater or equivalent testing effect than an exact match (Carpenter & DeLosh, 2006; Kang et al., 2007; McDermott et al., 2014; Rowland, 2014). To account for this, the retrieval effort perspective suggests that processing depth or effort have greater impact on delayed test performance than precise retrieval strategy. For example, McDaniel et al. (2007) compared weekly short answer to multiple-choice retrieval practices and reported highest testing effect gains in the short answer condition on a delayed multiple-choice test. Others have reported larger or comparable effects for free recall (Endres & Renkl, 2015; Glass & Sinha, 2013; Kang et al., 2007; McDaniel et al., 2012; Smith & Karpicke, 2014. Relevant to teaching is that retrieval practice and delayed test questions can be different, affording flexibility to instructors who prefer not to repeat retrieval practice on a delayed test.
Expanding on retrieval effort, retrieval induced facilitation (Chan, 2009; Chan & McDermott et al., 2006) predicts testing effects for material related to, but not tested during retrieval practice, due to spread of activation in a semantic network. Several researchers have observed testing effects for retrieval practice questions that covered the same or different concepts from those used in the mid-term and final exams (e.g., Chan et al., 2006; Johnson & Mayer, 2009; Mayer et al., 2009). The precise questions asked during retrieval practice are less relevant than their ability to increase student attention, organize knowledge, and practice metacognitive skills (Mayer et al., 2009). However, the extent to which retrieval practice facilitates testing effects for related information rests on two caveats: relatedness between tested and untested material, and ease by which material can be integrated into semantic networks.
The testing effect for initially untested material weakens as material becomes more distantly related to that which was tested (Butler, 2010). Batsell et al. (2016) compared testing effects for identical, similar, or new practice retrieval questions and observed an advantage for identical and similar conditions over the new condition. Foss and Pirozzolo (2017) observed a similar pattern, as did Bjork et al. (2014) with multiple choice questions and Carpenter (2011) with cued recall. Chan and Langley (2011; Chan et al., 2009) also suggest that retrieval practice enhances learning new information and modifies existing networks of related information. Retrieval practice on a portion of material results in testing effects for closely related material and may further benefit students by facilitating learning of subsequent material.
The second caveat of similarity is that integration matters. Testing effects extend to initially untested material if the studied material is coherent, logically ordered and closely related to retrieval practice questions. Compelling evidence for this comes from Chan (2009; Chan et al., 2006, 2018). Half the participants read an article with the text naturally unfolding (high integration) and the other half read the same material but sentences in each paragraph were randomly ordered, akin to fact lists (low integration). Delayed test questions were repeated from retrieval practice, closely related to retrieval practice (same paragraph), and from different paragraphs. Testing effects for repeated questions occurred in both integration conditions, but the effect for closely related questions was restricted to high integration, and no advantage for distantly related information was observed in either integration condition. These findings suggest that coherent, logically ordered and closely related material is more easily integrated into semantic networks, include more elaborations, thus enabling activation of more information. Low integration conditions may reflect smaller or more loosely organized networks that fail to distinguish relatedness. Several others reiterate the suggestion that retrieval practice facilitates knowledge organization (Bishara et al., 2015; Carpenter, 2009; Endres & Renkl, 2015; Grimaldi & Karpicke, 2012; Jensen et al., 2014; Karpicke & Grimaldi, 2012; Pyc & Rawson, 2009). Chan’s (2009, 2018) work clarifies that the benefits of retrieval practice for related material is dependent upon the ease by which the information can be integrated into knowledge organization structures.
Chan’s (2009; Chan et al., 2006) findings may also suggest that testing effects may be more elusive for academic material perceived by learners to be less cohesive and logically ordered (i.e., fact lists). In introductory psychology, topics such as biopsychology, sensation and perception, and health may seem more akin to fact lists due to high quantities of unfamiliar anatomical terms (i.e., low integration), whereas other topics, such as memory, psychopathology, and states of consciousness unfold more coherently, more easily integrating into existing semantic networks. In my experience and reported by Peck et al. (2006) introductory students have the greatest self-reported difficulty and the lowest test performance for health, sensation and perception, and biopsychology topics. In a laboratory study, Woolridge et al. (2014) found a testing effect for repeated, but not related questions following a 45-minute reading of an evolutionary biology chapter, indicating that the material may not have lent to high integration. Jensen et al. (2014) observed testing effects on a final exam in a biology course (non-majors) for related material only when retrieval practice questions directly facilitated elaborative thinking, but not when they focused on simply remembering and understanding the material. These two studies provide converging support that topics not inherently well-integrated may require retrieval practice questions that directly facilitate elaboration.
Although Woolridge et al. (2014) was conducted in a laboratory, their method has several similarities to the teaching context that highlight general challenges to finding testing effects in an authentic classroom. Aside from the chapter topic and length of presentation, their retrieval practice consisted of one question for each chapter sections, followed by 10-minutes of an unrelated task and then a 20-minute delayed test. Because class sessions often include a variety of topically related concepts presented for an hour (or two) with instructors posing few questions per topic to gauge student comprehension, the quantity and organization is probably quite like Woolridge et al.’s (2014) chapter presentation and retrieval practice. Perhaps long sessions with complex, novel material and sparse retrieval practice is unlikely to facilitate testing effects on related, initially untested material.
Two other avenues of research support this conclusion, and warrant consideration for achieving testing effects in teaching. First, memory performance decreases under prolonged (about 17 minutes) high load conditions, even for simple stimuli (Krimsky et al., 2017). Further, mind wandering can be significant during a lecture (Risko et al., 2012), even though live lectures result in higher memory performance than recorded lectures and textbook reading (Varao-Sousa & Kingstone, 2015; Varao-Sousa et al., 2013). The prolonged high load nature of a college textbook chapter may have been an obstacle to testing effects in Woolridge et al. (2014) and may also be in the classroom lecture. Second, sparse questioning in a high load context may be insufficient for achieving testing effects for related material. Jensen et al. (2013) suggest that more retrieval practice questions lead to better delayed test performance. Of priority, then, is to alter classroom procedures to enable instructors to use different retrieval practice and delayed test questions, while achieving testing effects for related concepts. One possible solution for teaching is to incorporate a sufficient quantity of retrieval practice questions that cover the same and closely related concepts on the delayed test at intervals that minimize memory load.
To summarize, testing effect research offers several methodological insights to enhance performance in an authentic class. Misunderstandings at retrieval practice should be corrected using feedback. Also, testing effects are likely for repeated retrieval practice questions for loosely or highly integrated material, but can extend to information that is closely, but not distantly, related to retrieval practice for highly integrated material. Implementing retrieval practice does not appear to have a general effect on memory for concepts that are topically related, such as in the same article, chapter, or lecture. Rather, testing effects may be more likely for information from the same paragraphs or short subsections of an article, chapter, or lecture when the material is coherent and logically ordered for the learner and there is a sufficient quantity of retrieval practice questions. Lastly, retrieval practice for loosely integrated or incoherent material may require retrieval practice questions explicitly designed to facilitate of elaborative processing. Given these findings, highly integrated information followed by adequate amounts of retrieval practice questions should increase performance on repeated and closely related delayed test questions.
Practical Application to the Classroom
At issue is how to take advantage of testing effects in an authentic course, using methods that are feasible and sustainable across multiple courses and semesters. The studies cited thus far were mostly conducted in laboratories or used classrooms as pseudo-laboratories to maintain internal validity. This limit the direct application of their methods in an authentic classroom because the workload, resource, and disruption to routine that are acceptable in an experimental study lasting one semester or class session become deterrents in practice. Consider that experimental methods are inappropriate for long-term use in a classroom because control conditions (e.g., restudy or nothing) of between-groups designs gives a learning advantage to some students over others (e.g., Bangert-Drowns et al., 1991; Foss & Pirozzolo, 2017) and within-subjects designs place all students at a learning disadvantage for some material (e.g., Bjork et al., 2014; McDaniel et al., 2012; Roediger et al., 2011). Because of the inequitable treatment inherent to experiments, their methods are an unfair practice to sustain in a classroom.
Vajdanoska et al. (2010) is an example of achieving good control in class that is neither feasible nor sustainable in an authentic classroom for all the reasons mentioned above. Students in developmental psychology read a 10-minute PowerPoint presentation, followed by retrieval practice (16 questions) and feedback on half the questions. Students who directly asked about non-feedback items were told to find the answers in the material posted on the course website later. Students were instructed not to take notes during the presentation, avoid discussing with other students, and the material for study was available after the unexpected delayed test. While achieving good control, these elements cannot be feasibly or sustainably replicated in an authentic classroom. A multi-question paper quiz after 10 minutes of presentation consumes too much time in class (reducing content) and is burdensome to create and manage (adding workload and resources). There are also diminished benefits to students when their questions go unanswered, notetaking is discouraged, and material is available days or weeks later, after the test. The value of studies with good control to transitioning from laboratory to classroom cannot be overstated, but their application in practice requires methodological adjustments.
Beyond experimental control, other studies conducted in the context of a class preclude sustainable and feasible application in practice. For example, Batsell et al. (2016) and Leeming (2002) achieved testing effects using a test-per-day procedure. Batsell et al.’s (2016) students took 21 tests and Leeming’s (2002) took 22–24 tests. Reserving 15–20 minutes per class meeting (Leeming, 2002) may be difficult in a course already constrained by lectures, activities, and exams. Correcting and providing individual feedback adds to workload, as does tending to copious handouts which may not be at all feasible in large classes. Although fruitful, Batsell et al.’s (2016) and Leeming’s (2002) frequent retrieval practices seem overly burdensome, and therefore, unsustainable in practice.
One solution to burdensome time use for in-class retrieval practices is to move them online. McDaniel et al. (2012) used online quizzes to successfully demonstrate a testing effect for repeated and related information in a small (n = 14) biopsychology course. Trumbo et al. (2016) observed a testing effect on a final exam following weekly online quizzes in a large section introductory psychology (n = 715), but Downs (2015) did not find a testing effect with daily online quizzes in introductory psychology (n = 67). Because of the myriad differences between their courses, strong conclusions cannot be drawn about the effectiveness of online retrieval practice. Further, online test administration may be one solution for instructors and students who have the time, skill, and consistently equal access to technical resources, but this is not the case for very many instructors or their students. Moreover, frequent online retrieval practice adds to student workload, and an appropriate balance of online retrieval practices to other course requirements has not been empirically determined.
Another practical consideration is that class sessions can be unpredictable. Planned lessons run into the next class session or start earlier than anticipated. In which case, retrieval practices prepared by the instructor for the start or end of a session, or even weekly online require repeated modifications that become burdensome to instructor workload, and therefore, unsustainable across semesters. One solution is to use an adjunct question approach distributing retrieval practice questions throughout a lecture. Neither Healy et al. (2017) or Weinstein et al. (2016) observed testing effect differences between retrieval practices distributed after short blocks or at the end of a session. They used written responses, but Shapiro and Gordon (2012) observed testing effects using an iClicker system to collect responses to multiple choice questions presented throughout PowerPoint presentations across 11 introductory psychology chapters. The methods used in these studies may not solve issues of workload or technical resources, but they indicate that distributing retrieval practice questions throughout a lecture, instead of at the end or beginning, preserves testing effects.
Distributing questions throughout a lecture also addresses the memory load and mind-wandering issue discussed earlier in reference to Woodridge et al. (2014). Healy et al. (2017) observed progressive gains on items tested after short blocks of information over those tested after presentation of all blocks. While not resulting in differences on the delayed test, they suggest that questioning after each block enhanced motivation and decreased boredom. This complements Burke and Ray (2008), who observed that questioning at approximately 15-minute lecture intervals sustains attention throughout the class duration. They and Pastotter et al. (2011) suggest that questioning shifts attention to a new task and returning to lecture introduces a new shift, thereby repeatedly resetting attentional focus throughout the duration of the lecture. Refocusing attention to lecture can occur after only a 90 second alternate activity (Rosegard & Wilson, 2013). Schacter and Szupner (2015) also observed that distributed questioning reduced mind wandering during pre-recorded lectures and improved test performance. Students who are engaged for too long with the same material are at an attentional (and learning) disadvantage compared to students who took short break from the material (Szupnar et al., 2013). Mulligan and Picklesimer (2016) and Buchin and Mulligan (2017) provide compelling evidence for benefits of retrieval practice on attention. They compared retrieval practice to restudy under conditions of full or divided attention. Delayed test performance following retrieval practice was unaffected by divided attention, but significantly reduced following restudy. They concluded that retrieval practice protects against distractors and sustains attention. Thus, it seems as though testing effects could be achieved by introducing retrieval practice questions after short lecture blocks, with the added benefit of resetting or sustaining student attention and limiting memory load.
To summarize, adapting experimental methods that achieve testing effects to classroom practice requires procedures that are thorough enough to achieve the desired testing effect, equally available to students, easy enough to implement across semesters, and not onerous to create or monitor. Frequent and distributed retrieval practice questions that adequately cover most of the material throughout a presentation has successfully resulted in testing effects. The challenge is for instructors to find ways to adapt and integrate findings from the literature to be less time consuming, burdensome, or artificial and more flexible than experimental methods. To be sustainable, retrieval practice and delayed tests must occur at natural punctuations that compliment class sessions and course structure, adding minimal burden to instructor and student workload.
The Current Study
The current study was designed to determine if findings from the testing effect literature could be replicated in an authentic classroom using feasible methodology that could be sustained across semesters. The literature has provided several methods that seem critical for achieving testing effects including feedback at retrieval practice, close or exact relationships between retrieval practice and delayed test questions, and that highly integrated material is most likely to result in testing effects. Moreover, distributing questions throughout a lecture achieves testing effects, minimizes instructor modifications, and may have the added benefits of reducing memory load and sustaining attention. Use of online retrieval practice or iClicker systems in class works, too, but require skill, resources, alterations, or training that many are unable or unwilling to obtain. The current study was designed as a straightforward solution to add retrieval practice using tools already in place in an existing lecture-style course, in a manner consistent with the boundaries of testing effects suggested by those cited, herein.
Hypothesis 1
Students with retrieval practice distributed throughout a lecture will demonstrate higher performance on a delayed test than those in previous classes who did not have retrieval practice. This conceptually replicates the findings of several studies (e.g., Healy et al., 2017; Shapiro & Gordon, 2012; Weinstein et al., 2016) by incorporating retrieval practice after short blocks of information.
Hypothesis 2
Testing effects will be consistent across multiple sections of Introduction to Psychology but may not be observed for biopsychology, sensation and perception, or health topics. While testing effects have been observed with jargon-heavy courses (e.g., McDaniel et al., 2012), introductory psychology students are not immersed in that content for the semester, have varying degrees of interest and minimal background in cellular or functional neuroanatomy, and they struggle the most with these topics (Peck et al., 2006). Moreover, the work of Woolridge et al. (2014) and Jensen et al. (2014) suggest that retrieval practice questions may need to directly facilitate elaboration to enable testing effects for these topics in introductory psychology.
Hypotheses 3
There will be no testing effect differences will be observed between retrieval practice questions that are identical to and closely related to questions on the delayed test. Adequate coverage of concepts has resulted in testing effects for identical and closely related questions (Batsell et al., 2016; Bjork et al., 2014; Chan, 2009; Chan et al., 2006; Jenson et al., 2013) and this should extend to an authentic classroom.
Method
Participants
Three hundred and eighty-four students enrolled across four class sections of Introductory Psychology participated for no compensation. The classes were from Fall 2018 (F18) (n = 96), Spring 2019 (S19) (n = 96), Fall 2019 (F19a) (n = 97) and Fall 2019 (F19b) (n = 95). Mean age = 20, 75% female, 7% Asian, 8.5% Black or African American, 65% Caucasian, .25% Pacific Islander, 13% Hispanic or Latino, 6% more than one or unspecified.
Materials
Delayed test
Seven multiple choice unit quizzes of varying lengths spanning 15 topics in introductory psychology were used to define delayed test performance. Length of quizzes depended on content quantity as follows: Quiz (Q)1: Psychopathology & Psychotherapy (33 questions); Q2 Health & Social Psychology (44 questions), Q3 Innate Behavior & Learning (27 questions), Q4 Memory & Cognition (23 questions), Q5 Sleep & Drugs (36 questions), Q6 Sensation, Perception and Biopsychology (41 questions), Q7 Scientific Methods & Psychological Perspectives (26 questions). Quizzes were printed on 8.5 × 11 plain paper, double-sided, and ordered similarly to the lecture. A machine scored form was used by students for responses.
Retrieval practice questions
Short blocks of four to seven multiple choice questions were added to all PowerPoint presentations (PP) for F19a and F19b sections. Neither the precise timing nor number of preceding slides were manipulated. Rather, the blocks were inserted after major concepts or short sections spanning approximately three to five lecture slides. Two question types were used. Repeated questions were identical to those on the unit quiz, new questions were different from the quiz, and not paraphrased, flipped, or reworded quiz questions. New questions focused on a different or related aspect of a concept than the delayed test. For example, a question on Quiz 2 was, “Which of the following is considered to be a chronic stressor?” (Answer: Living in a high crime area), and the corresponding new retrieval practice question was “What kind of stressor is difficulty at work or starting college” (Answer: Life Changes and Strains). New questions were intended to be equal difficulty to repeated questions, but this was not independently verified. The number of retrieval practice questions in a presentation were equivalent to the number of questions on the corresponding quiz, which could not be altered because the quizzes were previously used in control condition sections. Repeated questions were all questions used on quizzes 1, 3, and 6; New questions corresponded to the same concepts on unit quizzes 2, 4, and 5, and were identical in quantity. This parsing ensured an even split across quizzes and avoided predictability.
Procedure
Informed consent was obtained from F19 students after the sixth quiz to avoid participant reactance, and no indications were given to students that their performance was being studied. Comparison classes (F18 and S19) were determined by the Institutional Review Board (IRB) to be archival, sufficiently ordinary, confidential, and without risk.
Students in all four sections were required to attend class twice per week for 110 minutes per session (4 credits). Each session included a lecture accompanied by a PP. PP with retrieval practice were used for both F19 sections, F18 and S19 had PP without retrieval practice. Most sessions also included one or more written in-class activities, none of which duplicated the retrieval practice, but no attempt was made to avoid covering the same concept. Activities occurred after retrieval practice for F19 or at the conclusion of the relevant topic for sections without retrieval practice. Units 2 and 6 used five lecture sessions due to high content quantity and/or complexity, all others used four. The lecture schedule was based on my experience and included in the syllabus.
Every attempt was made to avoid experimenter bias. Retrieval practice was the only manipulated difference between F19 sections to the baseline S19 and F18. Lectures, PP, and classroom activities were otherwise identical, including the same jokes (bad ones, too), and rare typographical errors or awkward wordings were not corrected. Data was compiled after the end of the F19 semester to avoid potential bias of insider knowledge of student performance.
Retrieval practice blocks occurred after approximately 15 to 25 minutes of lecture (three to five lecture slides). Questions were presented and answered sequentially using slideshow animation. The instructor (author) read each question aloud and asked students to orally volunteer answers, followed by a quick hand poll of agreement among the students. It is unlikely that oral responses would diminish testing effects, as Putnam and Roediger (2013) observed no differences in strength of the testing effect between oral and written responses. The instructor orally provided confirmatory or corrective feedback for each question before presenting the next question or resuming lecture. The instructor explained answer rationale only on the rare occasions that students had difficulty, asked for explanation, or offered multiple answers. Each block used 2–4 minutes, and total time used per class session was within the range of 8–15 minutes. The precise duration of lecture and retrieval practice blocks were intentionally loosely controlled to fit genuinely into the natural discourse and topic coverage of an authentic class. All students had access to PP without retrieval practice questions through the learning management system (Blackboard) for the entire semester. Retrieval practice questions were not accessible outside of class or on personal devices, other than what students wrote down at time of presentation. No electronic devices were allowed during class sessions as a course policy.
Students completed the delayed test (unit quiz) in class at the start of the class session that followed completion of unit lectures. Quiz duration depended on length, varying from 30 to 40 minutes. No makeup quizzes were allowed, but only six quiz scores were included in the final grade. As such, all students did not take all quizzes, and that was their prerogative. All units were accompanied by extensive study guides. Optional written critical thinking assignments collected before the start of the quiz were combined with the quiz scores to create the unit grade. Study guides and optional assignments were identical across course sections and available to students for the duration of the semester on Blackboard.
An attempt was made to address the confound introduced from non-random assignment. PP without retrieval practice was used for all four sections in the final unit of the course (unit 7) to enable comparisons of performance when the two groups were treated equally. Unit 7 was selected for this comparison because it seemed like the only ethical way to address the confound. Unit 7 scores were inherently lower stakes than previous units because they can only increase, but not decrease final grades because the lowest unit score was dropped from calculations. It was more prudent to avail students of retrieval practice benefits on higher impact units. Also, historically, only about one-third of students participate in Unit 7, which would introduce a selective attrition confound if compared to other units. Moreover, Unit 7 provides one last opportunity for A- to F students to increase their final averages, it is exceptionally rare that students who already have high or satisfactory grades take this quiz. For these reasons, this unit is naturally separable and would be inappropriate to include in comparative analyses with other units. While addressing the semester confound by comparing the classes on Unit 7 has limited value, investigations in the context of an authentic classroom require methodological adjustments that prioritize ecological over internal validity.
Results
Sensitivity Power Analyses
Three confirmatory hypotheses were tested: Retrieval practice will result in higher performance than those without (between groups). A sensitivity power analysis using G*Power (Faul et al., 2007), α = .05 and 1 − β = .95, d = 0.34, indicated that n = 384 was sufficient to detect a small to medium effect (Cohen, 1988; Cumming & Calin-Jageman, 2016) for an independent t-test of the hypothesis. Testing effects will be observed across multiple units in an introduction to psychology course. These analyses exclude students (listwise) who did not take the first six quizzes. A sensitivity power analysis using G*Power (Faul et al., 2007), α = .05 and 1 − β = .80, f2 = 0.05, indicated that n = 277 was sufficient to detect a small effect (Cohen, 1988; Selya et al., 2012) in a one-way MANOVA with six dependent variables. No difference will be observed between repeated and new retrieval practice questions (repeated measures). A sensitivity power analysis using G*Power (Faul et al., 2007) (α = .05 and 1 − β = .95) resulted in dz = .26 indicating that n = 192 was sufficient to detect a small effect for the nondirectional paired samples t-test reported below (Cohen, 1988; Cumming & Calin-Jageman, 2016).
Hypothesis 1
Students with retrieval practice distributed throughout a lecture will demonstrate higher performance on delayed tests than those who did not have retrieval practice. The mean scores (percent correct) on Q1 through Q6 for classes with and without retrieval practice and were submitted to an independent t-test, t(382) = 3.94, p < .001, 95% CI [2.27, 6.81], d = 0.40. Levene’s test, F(1, 382) = 0.92, p = .34, indicated equal variances. Students in the retrieval practice condition (M = 67.29 SE = 0.83) scored significantly higher than those without (M = 62.66, SE = 0.80), supporting this hypothesis. This is a medium effect size (Cohen, 1988; Goulet-Pelletier & Cousineau, 2018) and the true population mean between groups is likely to differ between 2.27 and 6.81 quiz points, which could be an important increase in the assigned letter grade for many students.
Hypothesis 2
Testing effect will be consistent across multiple sections of Introduction to Psychology but may not be observed for biopsychology, sensation and perception, or health topics. Quiz 1–6 scores (percent correct) were submitted to a one-way MANOVA using Retrieval Practice (With, Without) as the independent variable, Wilkes Λ = .66, F(6, 270) = 23.51, p < .001,
Retrieval Practice Conditions Means (M), Standard Errors (SE) for Six Quizzes.

Point advantage for retrieval practice across six quizzes. Note. Figure shows the average difference between students with and without retrieval practice for each quiz (abbreviated as Q). * p < .05.
Because students with 0 grades were excluded, listwise, from this MANOVA, the loss of power may have hindered detection of a small effect for Q2 or Q6 (i.e., Type II error). This is unlikely because the sensitivity power analyses confirmed sufficient sample sizes to detect small effects, and the lack of significant difference between retrieval practice groups for these two quizzes was confirmed by a higher powered independent t-tests for Q2, t(362) = 0.95, p = .34, 95% CI [1.47, 4.29], d = 0.10; and Q6, t(362) = 1.22, p = .22, 95% CI [1.21, 5.17], d = 0.13. These analyses suggest that retrieval practice during lectures generally improves quiz performance to a degree that can impact letter grades, but the retrieval practice used here was not sufficient to reveal testing effects for health (Q2) or perception and biopsychology (Q6). Table 2 shows percentile ranks for grade ranges used by the instructor, which provides a more tangible understanding of the effects of retrieval practice on grades.
Retrieval Practice Comparison of Percentile Rank for Each Grade Range.
Note. Shown are the grade ranges and percentile ranks with and without retrieval practice, excluding any optional credit students completed. By this metric, more students achieved higher grades averages with retrieval practice.
Hypotheses 3
No testing effect differences will be observed between retrieval practice questions that are identical to or closely related to questions on the delayed test. The means across quizzes with repeated retrieval practice questions (Q1, Q4, Q6) (M = 67.10, SE = 0.85) and new retrieval practice questions (Q2, Q3, Q5) (M = 67.37, SE = 0.91) were compared using a paired-samples t-test, revealing no significant difference, t(190) = −0.48, p = .63, 95% CI [.84, 1.38], d = .03. To assess whether the non-significant effect of these quiz combinations was comparable to students without retrieval practice, the mean quiz scores for these two quiz groupings (Q1, Q4, Q6 and Q2, Q3, Q5) were submitted to a 2(Retrieval Practice: With versus without) × 2(Quiz Grouping) mixed factor ANOVA. This revealed no interaction effect, F(1,381) = 2.56, p = .11,

Comparisons of repeated and new retrieval practice questions to no retrieval practice. Note. Figure shows no significant difference between retrieval practice, but equally and significantly better than quiz performance without any retrieval practice. In the retrieval practice conditions, Quizzes 1, 4, & 6 were repeated questions and Quizzes 2, 3, & 5 were new questions. * p < .01.
Post Hoc Tests
The following analysis address the confound of non-random assignment to classes and therefore, conditions. To determine if the classes with retrieval practice were naturally higher performing students than those without, scores on Q7 were submitted to an independent samples t-test, t(124) = 0.55, p = .58, 95% CI [3.35, 5.93], d = .10, suggesting no difference between the sections with (M = 59.67, SE = 1.92) or without (M = 60.95, SE = 1.42) retrieval practice when both were treated equally. Q7 was taken by students who wanted to boost their final grade (nwithout = 75, nwith = 51), and fewer students did so in the retrieval practice condition. These findings suggest that classes with retrieval practice were not generally higher performing than those without and more students were satisfied with their potential final grade with retrieval practice.
Discussion
The results supported the three hypotheses. Distributed retrieval practice throughout lectures consistently enhanced student performance on delayed tests, with the exception of quizzes focused on health (Q2), biopsychology, and sensation and perception (Q6); no differences were observed between repeated and new questions. The current methodology incorporated suggestions from the literature in a way that was feasible and sustainable in an authentic classroom. The only addition to workload in my existing introductory psychology course was creating new multiple-choice retrieval practice questions, but it was generally easy to do, will require minimal modification over time, and many instructors maintain a test bank where alternate questions already exist. Determining the minimally sufficient quantity of retrieval practice questions was not attempted, here, but a future study could empirically test that boundary. While findings of others indicate that the time between retrieval practices should approximate 15 minutes (Burke & Ray, 2008), the procedure used here suggests that more flexibility (15–25 minutes) can be sufficient for testing effects.
There are several possibilities for not finding testing effects on Q2 (health) and Q6 (biopsychology and perception). One possibility suggested in the introduction, is that students have an inherent difficulty integrating these topics into existing knowledge because a high amount of jargon, anatomy terms and complexity (Van Gog & Sweller, 2015). In addition, the current failure to replicate Chan’s (2009) and Woolridge et al.’s (2014) testing effects with repeated questions on Q6 may be due this study’s longer interval (many days) between retrieval practice and delayed test. Forgetting is more likely for loosely connected material when the interval between retrieval practice and delayed test spans days (Abel & Bauml, 2012; Van den Broek et al., 2016), but that may not be true for more highly integrated material (Adesope et al., 2017). The complexity and low integration of Q2 and Q6 topics, coupled with the long delay may have reduced the likelihood of a retrieval practice effect on these quizzes. Perhaps, retrieval practice questions that facilitated more elaboration or deeper processing would have revealed testing effects for these topics (Jenson et al., 2014, 2020; Pyc & Rawson, 2009).
Future Considerations
While the purpose of this study was to take a simple and broad approach comparing overall performance of groups with and without retrieval practice, future studies can incorporate higher control and preserve external validity. For example, the retrieval effort hypothesis suggests that students who answered retrieval practice questions (aloud or mentally) may have benefited more than those who only listened but individual student effort was not recorded and could not, therefore, be directly tested. The dependent variable of quiz scores included students who were vocally and mentally engaged, present and less engaged, and who were absent. The lack of control over individual student responses makes this data somewhat noisy, with effect sizes similar to other studies that did not monitor individuals, but lower than is typical for experimental studies (Adesope et al., 2017). On the one hand, that is the authentic classroom, but it is also difficult to draw strong conclusions from noisy data. A future study can replicate the current method using participation and attendance covariates (or factors), tracked by the instructor of small classes or teaching assistants (graduate or undergraduate) in large classes. Where tracking individual student engagement in retrieval practice poses a problem for instructor workload, consider that a capstone course consisting of a small group of undergraduate assistants can monitor individual student’s contributions, benefitting students and student assistants (Hauhert & Grahe, 2015), and the instructor.
Two additional considerations for conducting authentic classroom research are that it is less conducive to open source and control conditions have a limited shelf life. Open source materials are not appropriate when instructors prefer their test questions to remain private and open source data reduces participant confidentiality. Anyone with access to student records at my institution could determine the names of students in the courses used, here, which is especially problematic for small classes. This is unlike laboratory studies that maintain informed consent behind locked barriers and use codes for participant responses. In addition, the current study utilized a control condition from the immediately preceding semesters, but doing so required no updates to lectures, tests, or assignments. By the following semester (Spring 2020), any new manipulations were not comparable to that control condition due to course updates and the history confound of the pandemic. Even without historical events, demographics change over time. While new control condition data could be collected, it returns to the ethical issue of preserving internal validity by deliberately avoiding teaching methods that benefit students. As such, I cannot replicate my own study. While there may not be a way to address the exclusion from open source, the best recourse for scholarship of teaching in an authentic class may be to rely heavily on a well-replicated experimental literature, use only the most recent semesters as controls, and ultimately, rely heavily on others to replicate findings in their authentic classroom.
Instructional Implications
The current study observed testing effects in an authentic classroom using a simple methodology that added minimal workload, required no additional resources, and is easily sustained and updated across semesters. The methodology used in the current study was informed by findings from laboratory and classroom studies suggesting a testing effect when there is retrieval practice feedback, retrieval practice and delayed test questions cover the same or closely related concepts, and retrieval practices are introduced after short intervals of material. In the current study, short, multiple choice retrieval practice blocks were presented after 15- to 25-minute lecture intervals as part of the lecture PP. Students volunteered the answers and the instructor provided feedback aloud. Approximately half of the retrieval practice questions were repeated on unit quizzes and half were new and closely related to unit quizzes. No difference was observed between repeated and new questions, both having an equivalent and significant advantage over classes that had no retrieval practice. Any lecturer, face-to-face or online synchronous, can use the current retrieval practice procedure of read aloud, solicit volunteer responses, and provide feedback, at naturally occurring shifts between concepts.
Conclusion
This study is one contribution to an empirical conversation of applying testing effects in an authentic classroom. The value of internally valid laboratory research and using the classroom space as a pseudo-laboratory is in informing instructors what to do and how to do it, but practitioners have myriad other concerns, including workload, equity, and resources, that warrant more studies that prioritize ecological over internal validity, such as replicating the present study or by creatively utilizing resources unique to an authentic teaching context.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
