Abstract
Summer learning loss is a perennial concern for educators and parents alike. However, researchers have recently questioned whether summer learning loss is just a statistical artifact driven by how achievement is measured across the school year. In this study, we empirically investigated a plausible critique of summer learning loss research, namely that students do not put forth their best effort on the fall test compared with the spring test. While we cannot conclude based on our findings that students do in fact lose ground during the summer, we did not find evidence that seasonal differences in test effort are a main driver of summer learning patterns estimated with MAP Growth assessments.
Keywords
since Cooper et al.’s (1996) landmark study, a common understanding among the educators and public alike is that students routinely experience summer learning loss 1 (SLL; sometimes also referred to as “summer slide” or “summer setback”) and that the phenomenon helps explain some of the most vexing sources of educational inequality in the United States (e.g., Alexander et al., 2007; Entwisle et al., 2000). More recently, nationally representative data collected by the Early Childhood Longitudinal Study (ECLS-K), 2010–2011 provides our most comprehensive look at SLL in the early grades. On average, this work found near-zero levels of growth during the summers following kindergarten and first grade, which reflects more of slowdown than learning loss (von Hippel et al., 2018). To understand patterns of SLL beyond first grade, researchers have used NWEA’s MAP Growth assessments, which are administered multiple times a year to millions of U.S. K–8 students. For example, Atteberry and McEachin (2021) used data from almost 18 million students in Grades 1 to 8 and found that the average student lost 17% to 34% of the prior year’s learning gains during summer break. In addition, Kuhfeld et al. (2021) documented racial/ethnic gaps in SLL using test scores from 2.5 million K–8 students. They found that SLL varies greatly across students, and that race/ethnicity and school-level poverty do not explain much of the variance in summer learning patterns. Based in part on this body of research, policymakers and educators have implemented a host of policies to mitigate SLL (National Summer Learning Association, 2019; von Hippel, 2016).
However, there are long-standing questions about whether fall-to-spring growth (and correspondingly, SLL) is actually a statistical artifact driven by differences in fall and spring testing conditions (Borman & D’Agostino, 1996; Keesling, 1984; Slavin et al., 1989; von Hippel, 2019). Examples of these critiques include the use of nonlinked test forms between spring and fall of adjacent grade levels in early SLL research (von Hippel & Hamrock, 2019), the role of teachers’ coaching to potentially boost spring performance and depress fall test scores (Slavin et al., 1989), and other potential differences in students’ motivation and test effort across the fall and spring assessments (Baird & Pane, 2018; Slavin, 2020). Researchers have also pointed to a lack of replication in findings about the magnitude of SLL as evidence that SLL estimates may be very sensitive to how measures are constructed (von Hippel, 2019; von Hippel & Hamrock, 2019). Given the high cost of providing summer learning programs designed to provide enriching summer experiences, typically with a goal of mitigating SLL (Schwartz et al., 2021), it is important to understand whether our concerns about SLL are distorted or even unfounded.
In this study, we assess the relationship between SLL patterns and differential test effort using measures of test effort and score quality. For test effort, we use a test effort metric supported by decades of validity evidence: The proportion of items on which a student responded so quickly, they could not have understood the item’s content (a behavior referred to as “rapid guessing” [Wise, 2017]). For score quality, we are referring to metrics associated with whether individual scores are trustworthy, such as overall test duration, students’ standard error of measurement (SEM), and percentage of items answered correctly. If, say, the SEM is very high, or the individual is getting far more items wrong than would be expected on an adaptive test, then the validity of the scores for their intended uses may be called into question. Note that, whereas our effort metrics are supported by extensive research as being valid for quantifying test effort (summarized by Wise, 2015), deviations in the score quality metrics could be caused by factors other than low effort and are therefore used as evidence to corroborate effort results. We use MAP Growth assessment data that are often the basis for analysis in the SLL literature (e.g., Atteberry & McEachin, 2021; Kuhfeld et al., 2021; von Hippel & Hamrock, 2019) to address two research questions:
Background
SLL and Test Effort
Differences in summer learning patterns have been widely assumed to reflect disparities in family resources and access to enriching summer activities, such as summer camps, libraries, or other summer learning programs. However, family background explains a very small percentage of variation in SLL patterns (Burkam et al., 2004; von Hippel et al., 2018). Given the lack of documented evidence connecting student/community factors and SLL, researchers have speculated whether SLL reflects construct-irrelevant variance. For example, Baird and Pane (2018) suggested summer test score declines may actually reflect differences in testing conditions, including implicit or explicit pressures on students or educators to do well on spring assessments. For example, interim achievement tests given in the spring are often used as benchmarks for how students will do on state summative tests a few weeks later, with teachers (and perhaps students as well) potentially incentivizing students to try harder on the spring administrations given the scores’ use as an accountability benchmark. Alternatively, interim tests given in the spring are likely given lower priority than summative tests, which could actually reduce effort on the interim measure if it is more of an afterthought than in the fall. Slavin (2020) expanded on this theory, noting that fall-winter gains on MAP Growth assessments are much larger than winter-spring gains, which he takes as evidence that the fall score is downwardly biased.
Metrics Relevant to Understanding Test Effort
The validity of an inference from a given test score assumes students’ responses to all items reflect their knowledge of the domain of interest. Low effort is a violation to such an assumption. Currently, several metrics can be used to help understand a given student’s test effort, most of which rely on metadata from computer-based tests (CBTs) and, in particular, computer-adaptive tests (CATs). While some metrics have been designed specifically to quantify test effort (e.g., response time effort, described below), others are more general indicators of the quality of a test score (e.g., percentage of correct responses to a computer-adaptive assessment and the SEM). Regardless, both types of metrics can be used to substantiate conclusions related to differential effort across testing periods (e.g., fall vs. spring). In the following sections, we review the body of literature on test effort measures and other metrics of score quality that can be used to evaluate whether estimates of SLL are sensitive to construct-irrelevant factors (such as differential effort) on fall tests.
Response-Time Effort
Schnipke and Scrams (2002) divided test examinees into two categories: those exhibiting “solution behavior” and those exhibiting “rapid-guessing behavior.” Students in the latter category, who respond to a test item without sufficient time to have understood the question, are not engaged with the test during that item (Wise & Kong, 2005). Wise and Kong (2005) used an empirical approach based on the response time distribution for a given item to identify rapid-guessing behavior and generate an overall measure of a student’s test-taking effort, which they term response time effort (RTE). RTE scores range from 1 to 0 and represent the proportion of test items on which the student exhibited solution behavior. Supplemental Appendix A (in the online version of the journal) provides a description of how individual items are flagged as disengaged and the validity evidence supporting the use of this metric to flag disengaged test-takers. While RTE is supported by extensive validity evidence for its intended use in CBT/CAT settings (see Wise, 2015), the methods on which it relies nonetheless have limitations. For example, individuals could be disengaged in ways unrelated to effort, an issue we discuss more in the study’s conclusion. In addition, RTE was specifically designed to avoid overidentifying low-effort test-takers and therefore can be a conservative measure of disengagement (Wise & Kuhfeld, 2021). There is the definite possibility for Type 2 errors when identifying disengaged students.
Test Duration
Overall, test duration, or the minutes that students took to complete the test, is another approach to measure score quality (e.g., Kuhfeld & Soland, 2020; Soland, 2018). Prior research suggests students who complete a test much faster than is typical are not fully engaged with the material (e.g., Wise, 2015). Furthermore, there is evidence that students who spend longer on items are more motivated and conscientious, suggesting that test duration is likely related to test effort (Soland et al., 2019).
Percent Correct
On a CAT like MAP Growth, items are optimally targeted to a student’s estimated achievement (e.g., items are selected to be maximally informative based on the students’ prior correct/incorrect responses). As a result, students should get questions correct about 50% of the time. When students have a proportion correct on the test that is far lower than 50%, one might worry that the student is giving less than complete effort and randomly guessing across the test items. That is, a proportion correct below 50% indicates that students frequently responded incorrectly to items matched to their estimated achievement level.
Standard Error of Measurement
When tests are scored using item response theory (IRT), each student’s score is accompanied by an SEM that helps quantify the precision of that score. SEMs may also help identify instances when a student did not provide full effort on the test. On MAP Growth in particular, students often have an SEM of around 3 scale score points. While deviations from that typical SEM can occur for a variety of reasons, low effort is one possible explanation. Thus, while an anomalously high SEM is not necessarily a sign of low effort, it could raise suspicions if other metrics such as RTE also indicate low motivation.
Lingering Questions Related to Test Effort and SLL
The question of whether test effort explains SLL patterns has received little empirical investigation so far. Kuhfeld and Soland (2020) found that SLL estimates were not significantly different under two approaches for adjusting for rapid guessing, including (a) filtering out students with low RTE and (b) adjusting students’ test scores using an approach called effort-moderated scoring (Wise & DeMars, 2006). However, that study only examined reading test scores for the summer after fourth grade, and one cannot therefore be sure results generalize to other grades and subjects.
Methods
Sample
The data for this study are from NWEA’s anonymized longitudinal student achievement database. School districts use NWEA’s MAP Growth assessments to monitor elementary and secondary students’ reading and mathematics growth throughout the school year, with assessments typically administered in the fall, winter, and spring. MAP Growth is a CAT that precisely measures achievement even for students above or below grade level and is vertically scaled to allow for the estimation of gains across time. Test scores are reported on the RIT (Rasch unIT) scale, which is a linear transformation of the logit scale units from the Rasch IRT model. Reliability and validity evidence to support the use of MAP Growth to monitor student achievement and growth within and across grades is described in NWEA (2019). MAP Growth is used by districts for a range of purposes, including progress monitoring, universal screening and placement decisions, predicting student performance on state assessments, evaluating programs, and occasionally in school/teacher evaluation systems.
We use the test scores of approximately 2.7 million kindergarten to seventh-grade students in 12,957 public schools across the United States. In this study, we follow students across two school years (2017–2018 and 2018–2019) and one summer break (summer of 2018). We chose these school years because they are the most recent years of data collected that were not disrupted by the COVID-19 pandemic. The NWEA data also include demographic information, including student race/ethnicity and gender, although student-level SES is not available. Table 1 provides descriptive statistics for the sample by subject and grade. A comparison of the 12,957 schools in our sample relative to the U.S. population of public schools (78,153 schools serving Grades K–8) is provided in Appendix B of Supplemental Materials (in the online version of the journal). Overall, the sample closely aligns to the characteristics of U.S. public schools, with a slight overrepresentation of Black students, underrepresentation of Hispanic students, and slight overrepresentation of urban schools.
Sample Characteristics of the Full Analytic Sample
Note. FRPL = free or reduced-price lunch.
Analytic Approach
We employ the four metrics described in the “Background” section to quantify students’ test effort (or score quality more generally): (a) RTE, (b) overall test duration, (c) percent correct, and (d) SEM. Based on each of these measures, we created filters to remove students who showed signs of low effort on a given test event. We describe the thresholds employed for filtering in Appendix A of Supplemental Materials (in the online version of the journal). Supplemental Table A1 (in the online version of the journal) describes the characteristics of each of the filtered samples. Supplemental Appendix C (in the online version of the journal) describes the methodology used to produce the SLL estimates within each subsample. As a further sensitivity test, we also examine the effects of low effort on summer loss using an item-level rescoring approach to remove noneffortful responses (prior research conducted by Rios et al., 2017, shows that removing examinees can be problematic if test effort is correlated with the student’s true achievement). SLL results were not sensitive to this alternative approach for accounting for disengagement (see Supplemental Appendix D in the online version of the journal).
Results
Question 1. Descriptive Patterns by Season
We first examine descriptive patterns of student test scores and effort/quality metrics across term and grade cohort. Table 2 provides these descriptive statistics for math using the full analytic sample (see Supplemental Table A2 in the online version of the journal for the reading results). Overall, the average duration, SEM, RTE, and percentage of correct responses are highly similar between the spring and subsequent fall term. For example, students got an average of 51% of items correct in the spring of fourth grade relative to an average of 50% in the fall of fifth grade. The one exception is overall test duration. In Grades K–3, students spent longer on average on their fall assessment than the spring assessment in the prior grade (a difference of 3–8 minutes, which works out to approximately 4–10 seconds longer per item), while in the later grades test duration was 1 to 2 minutes shorter in the fall than the prior spring (1–3 seconds per item).
Averages by Grade/Term/Cohort of RIT Scores and Test Effort for Math
Note. Reading results are presented in Supplemental Appendix A (in the online version of the journal). C1 = Cohort 1 (students in Grades K–1); C8 = Cohort 8 (students in Grades 7–8); RIT = average test score; SEM = standard error of measurement; RTE = response time effort (% of responses that were effortful).
While we did not see strong seasonal patterns in test effort in our overall sample, it is possible that decreased test effort in the fall occurs in a subset of schools that may have attempted to manipulate testing conditions to artificially promote fall-spring growth (such as coaching students to give less effort in the fall as a means of boosting fall-spring gains). In Supplemental Appendix E (in the online version of the journal), we examined whether there are seasonal patterns in test effort and score quality in a subset of schools. School-level patterns were consistent with the overall sample results.
Question 2. The Relationship Between Test Effort and SLL Estimates
In addition, we estimate the sensitivity of SLL estimates to excluding students who showed patterns of lower test effort/quality based on RTE, test duration, SEM, and percent correct in a term. Figure 1 provides a comparison of the SLL estimates (reported as change in RIT score for each month of summer break) for the overall sample as well as our four restricted samples that removed disengaged test-takers (see Supplemental Table C2 in the online version of the journal for the full set of parameter estimates). Across all grades and subjects, we do not find that SLL is sensitive to various ways of removing students with low test effort or generally low-quality test events from our sample. Without filtering, we see an average SLL across grades of 1 to 3 RIT points per summer month. Regardless of whether students are filtered for RTE, test duration, percent correct, SEM, or all four criteria combined, the estimates of SLL remain between 1 and 3 RIT points per summer month.

Comparison of summer loss estimates from the growth model across various sample criteria.
Discussion
The idea that students lose ground academically over the summer has grown stronger among educators and policymakers. An increasing share of time and resources are spent on providing students summer programming to mitigate SLL from relatively hands-off programs that provide students reading material over the summer (Kim, 2006) to robust, multi-year interventions (Augustine et al., 2016). However, recently researchers have called into question whether SLL is a real phenomenon—oftentimes positing that differential test effort between fall and spring (due to differences in student motivation and/or explicit teacher coaching) is to blame—and, if not, suggested resources to prevent SLL would be better spent on different policies and programs.
In this brief, we tackle one of the main critiques against the SLL literature: Students spend a differential amount of effort on the fall test compared with the spring test. Regardless of the metric of test effort examined, we did not find evidence that seasonal differences in test effort/quality are a main driver of the SLL phenomenon. An important limitation is that our effort metric (RTE) is conservative in the sense that it is designed to avoid overidentification of lower effort. Furthermore, it likely does not capture other manifestations of low effort, such as when students respond in a typical amount of time, but are nonetheless providing suboptimal effort. We used RTE because it is supported by the most validity evidence for its intended use among effort metrics (Wise, 2015) and try to address its potential limitations using a range of metrics such as test duration, which avoids setting (conservative) response time thresholds separating effortful from noneffortful responses. While our analysis does not rule out all possible construct-irrelevant factors related to SLL, it does suggest that differential test effort between fall and spring is not likely one of them, at least when using MAP Growth assessments and using the particular measures of test effort available to us. Future research should examine whether findings vary in contexts in which assessments (including other interim assessments) are used for high-stakes decision-making and employing other effort metrics.
Supplemental Material
sj-pdf-1-epa-10.3102_01623737231165027 – Supplemental material for Testing an Explanation for Summer Learning Loss: Differential Examinee Effort Between Spring and Fall
Supplemental material, sj-pdf-1-epa-10.3102_01623737231165027 for Testing an Explanation for Summer Learning Loss: Differential Examinee Effort Between Spring and Fall by Megan Kuhfeld, James Soland, Brennan Register and Andrew McEachin in Educational Evaluation and Policy Analysis
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Notes
Authors
MEGAN KUHFELD, PhD, is a senior research scientist at NWEA. Her research seeks to understand students’ academic and social-emotional learning trajectories and the school and neighborhood influences that promote optimal growth.
JAMES SOLAND, PhD, is an assistant professor of quantitative methods at Curry School of Education and Human Development at the University of Virginia. His research focuses on connections among measurement, estimating growth, and program evaluation, with applied interest in social-emotional learning.
BRENNAN REGISTER, MA, is a PhD student at the University of Maryland, College Park. Her research focuses on investigating the performance of multilevel and standard prediction algorithms on large-scale educational data sets.
ANDREW MCEACHIN, PhD in Education Policy, is the director of NWEA’s Collaborative for Student Growth. His research focuses on understanding the causes and consequences of educational inequities.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
