Abstract
How accurate is memory? Although people implicitly assume that their memories faithfully represent past events, the prevailing view in research is that memories are error prone and constructive. Yet little is known about the frequency of errors, particularly in memories for naturalistic experiences. Here, younger and older adults underwent complex real-world experiences that were nonetheless controlled and verifiable, freely recalling these experiences after days to years. As expected, memory quantity and the richness of episodic detail declined with increasing age and retention interval. Details that participants did recall, however, were highly accurate (93%–95%) across age and time. This level of accuracy far exceeded comparatively low estimations among memory scientists and other academics in a survey. These findings suggest that details freely recalled from one-time real-world experiences can retain high correspondence to the ground truth despite significant forgetting, with higher accuracy than expected given the emphasis on fallibility in the field of memory research.
Episodic memories are not simply stories that we tell but rather descriptions of specific past experiences. We implicitly assume that our own memories are accurate (Brewer, 1988), and this assumption guides our decisions and establishes social correspondence (Mahr & Csibra, 2018). On the other hand, memories are not copies of the experiences they represent. First, they are transient—most encoded information is rapidly lost (Ebbinghaus, 1913; Richards & Frankland, 2017). Second, memory is reconstructive, sometimes distorting or augmenting the reality of the past (Bartlett, 1932; Loftus, 2005; Schacter & Addis, 2007).
Whereas the transience of memory has been appreciated for more than a century (Ebbinghaus, 1913), more recent experimental evidence for the fallibility of memory has captured the attention of researchers and the general public. Many factors can increase the probability of distortions or outright confabulations in recall, including suggestive cues (Lindsay, Hagen, Read, Wade, & Garry, 2004; for a review, see Loftus, 2005), interviewing style (Fisher, McCauley, & Geiselman, 1994), and schema-driven expectations (Brewer & Treyens, 1981). There is now no doubt that recall is vulnerable to contamination by these and other factors, and this vulnerability lies at the heart of influential theories emphasizing memory’s constructive and error-prone nature (e.g., Barry & Maguire, 2018; Loftus, 2005; Schacter & Addis, 2007). Yet relatively little is known about accuracy in recall of everyday real-world experiences, in which individuals are actively involved in the encoding experience, and retrieval is not deliberately influenced by error-producing manipulations.
Because the ground truth of memories for personal real-world events is usually unknown, experimenters devised alternative methodologies for measuring memory accuracy. Diary studies, for instance, sometimes revealed striking examples of distortion (Brewer, 1988; Linton, 1975). Similarly, flashbulb memories for public and affectively charged events are often vivid and confidently held even when incorrect (Talarico & Rubin, 2003). These findings led some researchers to conclude that autobiographical memories, particularly remote ones, are essentially narratives constructed out of schematic knowledge and inference (Barclay & Wellman, 1986; Barry & Maguire, 2018). Yet diary-recorded events are self-selected for their emotional significance (Brewer, 1988), and flashbulb events are by definition emotionally arousing and prone to distortion. Critically, neither kind of event is experimentally controlled at the time of encoding, limiting accuracy measurement.
The experimenters’ knowledge (or lack of knowledge) about the objective event details limits the resolution at which memory accuracy can be measured. In lieu of directly measuring the correspondence between recalled and encoded details, researchers often rely on consistency with diary records or initial recall attempts, both of which are likely to be affected by rapid forgetting and to alter subsequent recall attempts. Alternatively, researchers have relied on consistency across individuals’ memories of a shared experience (McKinnon et al., 2015) or correspondence with public records (Neisser, 1981). Measures of memory consistency, therefore, are often restricted to relatively few and coarse details that are repeated across individuals or across time points. Freely recalled autobiographical memories, on the other hand, routinely contain scores of event-specific, idiosyncratic details from single experiences (Levine, Svoboda, Hay, Winocur, & Moscovitch, 2002). The degree to which these recalled details accord with encoded details remains unknown.
To the extent that memory errors occur, they are thought to increase in older age (Devitt & Schacter, 2016; Diamond, Abdi, & Levine, 2020) and with the passage of time (Armson, Abdi, & Levine, 2017; Barclay & Wellman, 1986). This is consistent with findings that memories tend to transform from more detailed to more schematic with increasing age and retention interval (Levine et al., 2002; Sekeres et al., 2016). Most of these findings, however, come from forced-response paradigms that confound accuracy with quantity. Accuracy measurement requires participant control over responding and output-bound (what proportion of retrieved items is accurate?) rather than input-bound (what proportion of encoded items is retrieved?) analyses. These two approaches reflect theoretically orthogonal “correspondence” versus “storehouse” conceptions of memory, respectively (Koriat & Goldsmith, 1996).
Statement of Relevance
How accurate is memory for real-world experiences? In a survey, memory scientists and academics estimated that the answer to this question is around 40%. If this were true, it would mean that we often fool ourselves about what happened in the past, which would have devastating consequences in legal contexts, for example. To put these survey responses to the test, we asked younger and older adults to recall controlled or verifiable real-world experiences. We found that roughly 94% of the details they recalled were accurate. Forgetting (detail loss) increased in older adults and for older memories, but the accuracy of what was recalled did not change. These findings mean that experts’ intuitions about memory accuracy are overly pessimistic. More importantly, they mean that even though memory detail and quantity fade with time and age, when healthy adults freely recall memories of complex, one-time real-world events, they are overwhelmingly accurate.
Note: A companion article by Diamond and Levine (2020) appears online at https://doi.org/10.1177/0956797620958651 and on pages 1557–1572 of this issue. The two articles overlap in several ways, including research participants, but the theoretical issues explored in the two articles are sufficiently distinct to warrant their publication as separate works.
In the present study, we aimed to quantify accuracy in naturalistic recall using complex and immersive yet verifiable real-world experiences, pairing the experimental control of laboratory encoding paradigms with the ecological validity of autobiographical memory. Younger and older participants freely recalled their experiences after delays of 2 days to several years. We augmented the widely used Autobiographical Interview (AI) scoring system (Levine et al., 2002) to measure, at a fine grain, the output-bound proportion of spontaneous memory errors with reference to the objective event details. We then contrasted memory accuracy with detail (i.e., episodic richness) and quantity (i.e., forgetting).
We predicted that output-bound recall accuracy would be high notwithstanding forgetting (loss of quantity), which we expected to increase with aging and the passage of time. To relate our results to perceptions of memory accuracy, we conducted a survey of both memory researchers and academics from unrelated fields in which we asked them to predict the results of our study. Given the strong emphasis on fallibility in memory research, we predicted that these participants would underestimate the accuracy of memory as measured in this study.
Method
Participants
Each of 74 participants (see Table 1) took part in one of two controlled real-world events that were assessed as part of a series of studies designed to investigate aspects of memory retrieval not available in standard autobiographical memory paradigms. The first study investigated the effect of retention interval on recognition memory for a respiratory-mask-fitting procedure (the Mask Fit Test) in 135 hospital employees (Armson et al., 2017). A subset of 33 of these younger participants (referred to hereafter as the “MaskFit-Y” group) freely recalled the Mask Fit Test prior to recognition testing using the AI (Levine et al., 2002; not included in the study by Armson et al., 2017). The second study was designed to compare recognition memory for naturalistic versus laboratory versions of the same event, an audio-guided art tour of Baycrest Hospital, in 84 younger and older adults with little or no previous exposure to Baycrest Hospital (Diamond et al., 2020; we designated this event “Baycrest Tour 1.0” to distinguish it from different tours used in subsequent studies). Free recall of the tour was collected for participants in the naturalistic arm of this study (n = 41; referred to hereafter as the “Tour-Y” and “Tour-O” groups of younger and older participants, respectively). These free-recall protocols were also analyzed in a separate study in which we investigated the dynamics of memory search (Diamond & Levine, 2020). All participants were screened for history of neurological disease, active significant medical or psychiatric illness or consumption of medication that may affect memory, and active or recent substance abuse. These studies were approved by Baycrest’s research ethics board. For additional details, see the Supplemental Material available online.
Demographic Data and Remoteness for the Three Encoding-Condition Groups
Note: All Mask Fit Test (MaskFit-Y) participants were younger; the Y designation is for consistency with the Baycrest Tour 1.0 group labels (Tour-Y = younger; Tour-O = older).
Because these data were not collected specifically for the questions in this study, we did not conduct an a priori power analysis. Our first main question was descriptive: How accurate is recall of real-world experiences, given the encoding and recall conditions used in this study? Although the necessary sample size for this is unclear, the stability of our finding was supported by attaining similar estimates across two different events with independent samples (n = 33 and n = 41). Our second main question was about the relation between our empirical accuracy measurements and estimations of memory accuracy among memory scientists (n = 54) and education-matched academics in unrelated fields (n = 326).
The event-encoding conditions
The Mask Fit event
The Mask Fit Test is a unique, complex, and highly stereotyped procedure mandated for all employees at Baycrest Health Sciences (see Fig. 1a). The test is a provincially mandated standardized procedure instituted after the 2003 SARS outbreak. The purpose of the test is to identify properly fitted respiratory masks for all hospital employees. It is temporally extended and rich in multisensory detail while also being highly scripted—everyone is exposed to the same test elements, in the same order, in the same spatial context, and by the same technicians—making it useful as a naturalistic staged event for memory assessment. The procedure consists of the experimenter spraying a bitter solution on each participant’s tongue to ensure taste sensitivity, the participant donning a respiratory mask and a large white fume hood, the experimenter spraying the bitter solution into the fume hood, and the participant reading a passage aloud and making a series of head and body motions (Armson et al., 2017; for a list of all recalled details from the Mask Fit event, see Table S1 in the Supplemental Material). Because MaskFit-Y participants were recruited to the study weeks to years after the event, encoding was fully incidental. We note that past memory studies have used medical procedures as staged events with testing at long delays, particularly in children (e.g., Ornstein, Baker-Ward, Gordon, & Merritt, 1997).

Schematics of the two event-encoding conditions and example recall excerpts with detail and accuracy coding. In (a), the photos depict three of the scripted components in the Mask Fit Test. The text below the photos shows a recall excerpt for the Mask Fit event. The recall text is shown in black, and detail and accuracy scoring codes (in color) are shown above the recall text. The first field in each code identifies the detail as internal (Int) or external (Ext). The second field is the detail type (“event,” “perceptual,” “place,” “time,” or “emotion/thought”). The third field is the accuracy score (accurate, inaccurate, or unverifiable). The superscripted “1” after the accuracy score indicates that the Mask Fit Test did not take place on the second floor (“other” error), and the superscripted “2” indicates that there was no window in the room (perceptual error). In (b), a map of Baycrest Hospital’s ground floor showing the tour route is shown along with three example photographs of the 27 target items. Below the map, a recall excerpt from a young (Tour-Y) participant is shown. The superscripted “3” after the accuracy score indicates that the market was on the participant’s right (perceptual error). The superscripted “4” indicates that the ceramic chef was not holding a cake lift (semantic error).
The Baycrest Tour event
The Baycrest Tour event involved a prospectively designed real-world walking tour of the first floor of Baycrest Hospital, guided by a museum-style audio guide (see Fig. 1b). The audio guide controlled the target content, item order (which was also dictated by the tracklike physical layout), and encoding duration for each item. It is publicly available on OSF (https://osf.io/pmt7d/). The first floor of Baycrest is visually rich and contains many art pieces, exhibits, and architecturally distinct areas. The tour route formed a loop through several different sections of the building. Although participants were instructed to approach different target items, the tour route was generally unidirectional. Participants were instructed to examine different target items (e.g., paintings, portraits, and exhibits) and complete different tasks (e.g., locate a particular individual in a frame of portraits, locate a particular item in the gift shop; see Fig. 1b; for a list of all recalled details from the Tour event, see Table S2 in the Supplemental Material). Partway through the tour, participants had an interaction with a research confederate, during which the confederate asked a series of scripted questions. This served to create more recallable content and to increase the self-referential and engaging nature of the tour. Participants were unobtrusively monitored by the experimenter during the tour to confirm consistency of the protocol across participants. For full details of the events, see the Supplemental Material.
Procedure
AI administration and scoring
The AI (Levine et al., 2002) is a standardized semistructured interview and accompanying scoring method for quantifying the detail composition of memory narratives. During the AI, participants freely recalled the event (Mask Fit Test or Baycrest Tour; “Tell me everything you can remember about ___”), followed by general probes as needed to clarify instructions (e.g., “Is there anything else you can tell me about this event?”), followed by a structured specific probe procedure. This specific probe was originally intended to elicit details in people with memory impairment and therefore gives latitude to the examiner to probe for additional content in a manner that varies across participants according to details produced during free recall. We therefore restricted our analyses to free recall and general probe, which were administered in a consistent manner across participants (for full details of the AI administration and scoring, see the Supplemental Material and previous studies; e.g., Levine et al., 2002).
Memories were audio-recorded and then transcribed. Trained scorers segmented memories into component clauses expressing discrete details, each of which was classified as internal (specific in place and time to the target event; “episodic”) or external (not spatiotemporally specific to the target event; “semantic/factual,” “repetition,” “metacognitive,” or “other”). Internal details and nonspecific external details were subclassified as “event,” “place,” “time,” “perceptual,” or “thought/feeling.” Concerning memory detail, our measure of interest was the proportion of total details that were internal (internal-detail proportion), which reflects the richness or density of event-specific details in a memory while adjusting for individual differences in verbosity and event differences in content and duration. Memory scoring was performed by a team of research asssistants following a manual written by N. B. Diamond. See the Supplemental Material for training procedures.
Quantity scoring
Internal-detail counts can provide a rough proxy of memory quantity, but they do not speak to how much was not recalled (i.e., forgotten). Quantity of recall was characterized as the proportion of total verifiable details recalled by each participant (see below for explanation of verifiable details). Such a total is required to serve as a denominator indicating what was encoded, analogous to the total number of stimuli in a list in a classic laboratory recall paradigm.
Given the complexity of real-world episodes and individual differences in narrative-style free recall, it is impossible to objectively establish the “true” number of discrete bits of information encoded and recalled across a sample of individuals. To devise a proxy of this measure, for each event, we created corpuses of all unique verifiable details recalled across all participants (assimilating details that were sufficiently similar notwithstanding individual differences in word use), along with the identity of participants who recalled them (for corpuses and recall counts, see Tables S1 and S2; for a similar approach, see McKinnon et al., 2015). We considered any detail that referred to a verifiably universal event feature and that was recalled by more than one person to be one that could have reasonably been encoded by every participant. The resulting corpuses contained 61 details for the Mask Fit event and 209 details for the Tour event. We then returned to the individual protocols and counted the number of corpus details recalled and then divided this sum by the total number of details in the corpus for the identified event. This procedure enabled comparison of memory quantity across participants who experienced the same event.
Accuracy scoring
Figure 2 presents a schematic of the accuracy-scoring procedure. Following standard AI scoring, we assessed the accuracy of transcribed memories leveraging the experimental control provided by staged-event encoding conditions (for examples of similar procedures, see Dede, Frascino, Wixted, & Squire, 2016; Evans & Fisher, 2011; McKinnon et al., 2015). By definition, only internal details were amenable to accuracy scoring. Internal details were first deemed verifiable or unverifiable, depending on whether they pertained to any objectively verifiable elements of the events. For example, details about stable features of the environment (e.g., colors of walls or paintings, sizes of rooms or objects, locations) and actions or event features (e.g., looking at a given painting, relative timing) that were controlled by the event administrator (i.e., standardized protocol for Mask Fit Test and the audio guide for Baycrest Tour) were universal for all participants and therefore verifiable. Some nonuniversal or time-varying details (e.g., the time on the clock during a given participant’s tour) could be verified by reference to testing notes recorded by the administrator at the time of encoding. Any internal details that were unobservable or idiosyncratic (not universal) and undocumented were unverifiable. Thoughts are, by definition, unverifiable.

Schematic of the accuracy-scoring procedure, ordered from top to bottom. Transcribed recall protocols were decomposed into internal and external details using the Autobiographical Interview (Levine, Svoboda, Hay, Winocur, & Moscovitch, 2002). Internal details were classified as verifiable or unverifiable, and verifiable details were classified as accurate or inaccurate. The ultimate measure of memory accuracy was the proportion of verifiable details that were accurate. External details and unverifiable internal details (in light gray) were not included in the accuracy measure. For the error types, see the Supplemental Material available online.
On average, MaskFit-Y participants produced 23.93 (SD = 12.74) verifiable details (mean internal-detail count = 33.64, mean proportion verifiable = .70), Tour-Y participants produced 79.68 (SD = 38.04) verifiable details (mean internal-detail count = 92.45, mean proportion verifiable = .88), and Tour-O participants produced 52.05 (SD = 46.18) verifiable details (mean internal-detail count = 69.58, mean proportion verifiable = .63). As expected, age was associated with decreases in the extent to which episodic details reflected objective and verifiable elements of the events (for analysis of verifiable details, see the Supplemental Material). Overall, all three groups produced many internal details, the majority of which were verifiable and thus amenable to accuracy scoring.
Accurate and inaccurate details included any verifiable details that are true or false, respectively. Thus, every internal detail was scored as accurate, inaccurate, or unverifiable. Inaccurate details were classified according to error type (“sequence,” “labeling,” “perceptual,” “semantic,” “other,” or “recognition cued” [for Tour groups only]; for descriptions and examples of error types, see Fig. 1 and the Supplemental Material).
Data analysis
Data were analyzed using the R programming language (Version 3.6.3; R Core Team, 2020). We tested group differences using analyses of variance (ANOVAs), when appropriate. To test pairwise group difference, we used Welch’s t tests to account for unequal variances across groups, rounding the degrees of freedom to the nearest whole number, with Bonferroni-corrected p values unless otherwise noted. We report Cohen’s d effect sizes and Pearson correlations with 95% confidence intervals (CIs). All statistical tests were two-sided.
Results
Internal versus external details
We first examined the effects of group (MaskFit-Y, Tour-Y, Tour-O) on detail type (internal, external) to determine whether established findings in autobiographical memory for self-selected events generalized to the controlled encoding paradigms used here. We found a significant interaction between group and detail type, F(2, 71) = 22.77, p < .001 (see Fig. 3). Consistent with prior aging research, results showed that the Tour-Y group had significantly higher internal-detail proportions than the Tour-O group, t(24) = 3.92, p = .002, d = 1.29, 95% CI = [0.60, 1.99]. Higher internal-detail proportions for the Tour-Y group versus the MaskFit-Y group, t(52) = 2.98, p = .013, d = 0.74, 95% CI = [0.17, 1.31], were consistent with a recency effect; more recent events are recollected in greater detail than more remote events, although we cannot rule out the possibility that this finding was driven by other differences between the Tour and Mask Fit events. Similarly, we observed a significant negative correlation between retention interval and internal-detail proportion within the MaskFit-Y sample, r(31) = −.42, 95% CI = [−.08, −.66], p = .016 (see Fig. S2B in the Supplemental Material), although caution is warranted in interpreting this effect because of the small sample size. For full details and analysis of internal detail types, see the Supplemental Material.

Raw internal and external detail counts for each of the three groups. Black circles with white fill indicate group means. Error bars on the black circles indicate standard errors of the mean. Smaller colored dots depict individual participants and are slightly horizontally jittered to reduce overlap. In each box, the central horizontal line indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively. The whiskers extend 1.5 times the interquartile range. MaskFit-Y = Mask Fit Test, young participants; Tour-Y = Baycrest Tour 1.0, younger participants; Tour-O = Baycrest Tour 1.0, older participants.
Memory quantity
For each event, we measured recall quantity by creating a corpus of every unique, verifiable, and accurate detail that was recalled by at least two participants and then counting how many of these details were recalled by each participant (see the Method section and the Supplemental Material). There were 209 details in the corpus from the Tour event and 61 from the Mask Fit event (for full lists of all details in each corpus and the number of participants who recalled them, see Tables S1 and S2). This disparity between events is consistent with their different recency and complexity as well as the greater experimental control—and thus a higher proportion of verifiable details—in the Tour event. Given these differences, quantity scores should not be directly compared across the Mask Fit and Tour events. On average, MaskFit-Y participants recalled 23.75% (SD = 10.5%) of corpus details, Tour-Y participants recalled 21.18% (SD = 8.8%), and Tour-O participants recalled 14.63% (SD = 10.6%; see Fig. S4 in the Supplemental Material). Among Tour participants, there was a significant negative effect of age, t(35) = 2.13, p = .040, d = 0.68, 95% CI = [0.03, 1.34]. Among MaskFit-Y participants, recall quantity was negatively correlated with retention interval, r(31) = −.43, 95% CI = [−.10, −.67], p = .014 (see Fig. S2C in the Supplemental Material), declining in curvilinear fashion as a function of time, as expected (Ebbinghaus, 1913; Rubin & Schulkind, 1997).
Younger participants, therefore, recalled roughly one fifth to one quarter of all unique verifiable details, and this number declined with both age and retention interval. Our proxy measure for quantifying the number of encoded details almost certainly underestimated the true number of encoded details, so these results likely reflect an upper bound on memory quantity.
Memory accuracy
We calculated accuracy as the output-bound proportion of verifiable internal details that were accurate with respect to the objective features of the events and the environments in which they were embedded. Mean accuracy proportions were uniformly high, ranging from .93 (SD = .06) in the Tour-O group to .95 (SD = .06) in the MaskFit-Y group (Tour-Y: M = .94, SD = .04). One-sample t tests confirmed that accurate-detail proportion was significantly off ceiling in each group (ps < .001, ds > .85). On the other hand, we detected at least one error in 56 out of 74 participants (75.68%) and two errors or more in 34 participants (45.95%), suggesting that our accuracy analysis was sensitive to errors. Types of errors did not significantly vary by group (see Fig. S1 in the Supplemental Material).
We ran a one-way ANOVA modeling the effect of group on accuracy. In contrast to our observations of group differences in internal versus external details and verifiable details, we found no significant difference in accuracy, F(2, 71) = 0.49, p = .615 (see Fig. 4). Post hoc nonparametric pairwise group comparisons (Mann-Whitney U tests), conducted because of nonnormal distributions within each group, also did not reveal group differences (ps > .30, uncorrected). Indeed, Bayesian two-sample t tests suggested that the groups did not differ in recall accuracy (Bayes factors favoring the null over the alternative hypothesis [BF01s] = 2.49–3.31). Similarly, accuracy did not decline with retention interval in the MaskFit-Y sample, r(30) = −.19, 95% CI = [−.50, .17], p = .303, BF01 = 1.63 (see Fig. S2D in the Supplemental Material). Although Bayes factors lower than 3 indicate, at best, anecdotal evidence for the null hypothesis, these findings together provide no evidence for group- or time-related effects on recall accuracy.

Accuracy as a proportion of verifiable internal details in each group. Black circles with white fill indicate group means. Error bars on the black circles indicate standard errors of the mean. Smaller colored dots depict individual participants and are slightly horizontally jittered to reduce overlap. In each box, the central horizontal line indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively. The whiskers extend 1.5 times the interquartile range. MaskFit-Y = Mask Fit Test, young participants; Tour-Y = Baycrest Tour 1.0, younger participants; Tour-O = Baycrest Tour 1.0, older participants.
Therefore, despite observing changes in the quantity and detail of memory with increasing retention interval and age, we did not observe a concomitant decline in the accuracy of retrieved details. The absence of group differences in accuracy may have been affected by restriction of range and therefore should be interpreted with caution. Nonetheless, it is clear from these data that participants made few errors (as a proportion of output) in their recall of these events.
Survey of Beliefs About Memory Accuracy
To empirically gauge general impressions about memory accuracy in the memory field and beyond, we polled memory scientists’ and other academics’ estimations of free-recall accuracy, describing encoding and retrieval conditions mirroring those in our current study. We solicited separate estimates of memory quantity (how much of the encoded content would be recalled) and memory accuracy (what proportion of recalled information would be accurate) to ensure that these were not confused (see below for full instructions). Accuracy was framed as the proportion of all freely recalled details that are accurate, and these estimates can therefore be compared with our empirical accuracy data. Beliefs about memory quantity, on the other hand, are harder to relate to our empirical data because the notion of memory quantity is less clear and because we empirically measured quantity as a proportion of all details recalled by two or more participants.
Survey methods
Memory scientists (n = 68; master’s degree = 6, PhD = 28, postdoctorate = 17, principal investigator or professor = 14, other = 3) were recruited at academic memory talks and conferences in 2018, in person and via social media. They completed the survey online via the Qualtrics platform (www.qualtrics.com). They were screened for familiarity with the authors’ work on this topic. Academics in fields unrelated to memory (n = 350; master’s degree = 54, PhD = 148, postdoctorate = 13, professor or faculty member = 107, other = 28) were recruited via e-mail from graduate departments (excluding those related to psychology or cognitive neuroscience) at the University of Toronto. We excluded 22 total participants for estimating “0–10%” for all four survey questions, for either the quantity or accuracy estimates (see below). Participants who failed to respond to all questions (n = 16) were also excluded. After exclusions, the survey sample included 54 memory scientists and 326 other academics.
The instructions were as follows: Imagine the following scenario: A healthy 30-year-old adult attends an audio-guided museum tour as part of a memory experiment. Memory for the tour is tested using free recall (i.e., the person says everything they can remember about the event) 48 hours later. For the following questions, an “encoded detail” is a discrete bit of information that the participant heard and/or saw (e.g., a painting of a yellow sailboat). It does not refer to incidental or irrelevant information that was not attended (e.g., the floor tile was black). “Accurate” refers to the factual correctness of recalled details (e.g., “a painting of an orange sailboat” would be incorrect, if the sailboat was in fact yellow).
What
What proportion of these freely recalled details would be
Participants indicated their responses by clicking one of 10 options: “0–10%,” “10–20%,” “20–30%,” “30–40%,” “40–50%,” “50–60%,” “60–70%,” “70–80%,” “80–90%,” “90–100%.” They were then asked the same questions but with an imagined healthy 70-year-old participant. These questions were then repeated, with the following instructions: “Now, imagine the same scenario, but memory for the tour is tested (again using free recall)
Survey results
Accuracy estimates were low. Median estimated accuracy under the “best” of the theoretical conditions (a younger participant at a 48-hr delay) was 40% (M = 41.3%) among memory scientists and 30% (M = 36.7%) among other academics, indicating that more than half of freely recalled details would be inaccurate. There was no significant difference between overall accuracy estimates made by memory experts and other respondents, F(1, 378) = 2.73, p = .099. Compliance with survey instructions was supported by expected effects of the imagined participant’s age, F(1, 378) = 172.49, p < .001, and retention interval, F(1, 378) = 315.59, p < .001, on recall accuracy. Estimates were higher for an imagined 30-year-old versus a 70-year-old (paired Wilcoxon signed-rank test, V = 2,642, p < .001) and for 48 hr versus 2 years (V = 1,790, p < .001; see Fig. 5). These survey results, in conjunction with our findings, support Koriat, Goldsmith, and Pansky’s (2000) assumption that views of human memory accuracy are overly pessimistic. This is true of both memory scientists and other education-matched academics. The survey data about memory quantity are included for completeness in Fig. S5 in the Supplemental Material.

Estimates of memory accuracy made by memory scientists (n = 54; master’s degree = 5, PhD = 24, postdoctorate = 11, principal investigator or professor = 11, other = 3) and academics in fields unrelated to memory (n = 326; master’s degree = 50, PhD = 138, postdoctorate = 10, professor or faculty member = 102, other = 26). The graphs show the proportion of respondents who estimated the accuracy of encoded, freely recalled details provided by a healthy 30-year-old (left column) and a healthy 70-year-old (right column) after both a 48-hr delay (top row) and a 2-year delay (bottom row).
Discussion
We combined controlled real-world encoding paradigms with a novel recall-coding system to measure accuracy in naturalistic event recall at the level of single details. We contrasted accuracy with measures of memory quantity and richness of episodic detail. Participants spontaneously and reliably recalled dozens of unique event-specific details (with many recalling 100 or more), most of which were verifiable. On the other hand, these details comprised less than one quarter of encoded details, highlighting that the majority of information making up our experiences is forgotten or omitted from subsequent free recall. Despite age- and time-related declines in memory quantity and episodic detail richness, accuracy was high (93%–95% of verifiable details) in all groups. This suggests that memory for remote (days to years old) real-world episodes is more accurate than expected, as confirmed by our survey of intuitions about recall accuracy, in which the modal predicted error rate was greater than 50%.
Similarly high output-bound accuracy rates in free recall can be seen (although often not discussed) in standard laboratory recall studies, even those highlighting false memory (e.g., see discussion of Deese-Roediger-McDermott paradigm by Koriat et al., 2000). The present results extend these findings to copiously detailed descriptions of real-world experiences recalled up to years later. It is noteworthy that we detected at least one error in the majority of participants’ data. Thus, although infrequent, recall errors do occur under naturalistic encoding and free-recall conditions, although it is difficult to identify when errors stem from erroneous mnemonic representations and when they stem from spontaneous language use. Our finding that accuracy was robust to increasing retention interval and age stands in contrast to forced-choice recognition-memory studies of both laboratory and real-world events, in which both factors are associated with increases in susceptibility to false alarms (Armson et al., 2017; Barclay & Wellman, 1986; Devitt & Schacter, 2016; Diamond et al., 2020).
With respect to retention interval, the present results are consistent with those of previous studies showing high output-bound accuracy for naturalistic content after delays of weeks to years (Ebbesen & Rienick, 1998; Poole & White, 1993). But whereas initial event-proximal recall attempts likely enhanced or otherwise altered subsequent memory in these studies, our participants performed free recall for the first time days to years after their experiences, and among Mask Fit Test participants, encoding was fully incidental. Our findings parallel eyewitness-memory studies showing that forced-response accuracy declines with time, as expected, but free recall and high confidence recognition remain overwhelmingly accurate (Evans & Fisher, 2011; Wixted, Read, & Lindsay, 2016). In free recall, participants spontaneously filter out low-confidence information and adjust the “grain size” of memory reports over time, prioritizing accuracy at the expense of detail (Goldsmith, Koriat, & Pansky, 2005). Accordingly, accuracy was high in spite of a low proportion of recalled information (relative to the estimated possible encoded information), and it remained stable in spite of the further reductions in quantity and detail with age and time.
The effects of age on richness of episodic detail parallel those observed with self-selected autobiographical memories (Levine et al., 2002), bolstering interpretations of age-group differences in AI internal details as pertaining to memory processes or representations as opposed to extraneous factors such as event selection (Aizpurua & Koutstaal, 2015). Although sensitivity to age differences in accuracy may have been reduced by ceiling effects, we note that accuracy scores were significantly off ceiling in all three groups. Given the number of verifiable details recalled and the detection of at least one error in most participants, there was ample opportunity for more errors to occur. Our accuracy analysis was necessarily limited to event details that could be verified, which made up a smaller proportion of older adults’ recall narratives than younger adults’ recall narratives. The remaining details pertained to internal states (thoughts and feelings) or other unverifiable parts of the tour, such as transient environmental features. Although the accuracy of these details is unknown, we see no reason to expect that it would be lower than verifiable detail accuracy. Because our sample was composed of healthy older adult volunteers up to age 75 years, we acknowledge that these results may not generalize to individuals who are older or those with age-related disorders affecting memory.
Both events in the present study were more novel and distinctive than most everyday experiences, likely reducing the potential for memory conjunction errors (Devitt, Monk-Fromont, Schacter, & Addis, 2016), “wrong event” errors (Brewer, 1988), or other sources of false memory. Relatedly, the memories under examination here were likely less vulnerable to contamination by group discussion or multiple retellings than more personally significant or social experiences (Marsh, 2007). The events used in this study were not designed to generalize to all everyday events but rather to capture low-level features characteristic of everyday events. Many previous findings of retrieval distortions have used pseudonaturalistic narrative material such as stories or movie clips. It is possible that such materials, being more passively encoded and thematic than everyday experiences, may in fact exaggerate the degree to which memory is warped by schemata and expectation. Moreover, passive laboratory encoding conditions produce more false alarms than real-world experiences even with identical delays and testing procedures (Diamond et al., 2020).
These findings bear on general theories about the nature of episodic memory. Recent literature has emphasized that idiosyncratic details of one-shot experiences are rapidly lost (Misra, Marconi, Peterson, & Kreiman, 2018; Richards & Frankland, 2017), rendering remote episodic autobiographical memories highly or entirely reconstructed (Barry & Maguire, 2018). Although malleability and transience are likely costs of a flexible and future-oriented memory system (Richards & Frankland, 2017; Schacter & Addis, 2007), the converse selective pressures on memory storage and accuracy often go unacknowledged. Our findings, although not suggesting that episodic memories are high-fidelity copies of the experiences they represent, are consistent with findings that detail and accuracy in episodic memory are often strikingly high under unexceptional encoding conditions (Alba & Hasher, 1983; Bainbridge, Hall, & Baker, 2019; Brady, Konkle, Alvarez, & Oliva, 2008). The degree of correspondence between memory and reality is a function of encoding and retrieval conditions as well as the goals of the rememberer (Koriat et al., 2000). Anecdotally, we observed errors in relation to event elements that were confusable or recalled with many details, although we can only speculate about the causes of such errors because event features were not experimentally manipulated. Whereas the present study focused on the content of recall, we did not assess the degree to which the sequence of freely recalled details corresponded to the encoded sequence of events. In a separate study, we found that such temporally organized recall dynamics (previously assessed with word-list recall) can be reliably measured in real-world event recall and that these are sensitive to aging (Diamond & Levine, 2020). Future studies could test theories of false remembering derived from laboratory studies in naturalistic contexts by combining the staged-event method with manipulations at encoding (e.g., by altering the confusability of events) or retrieval (e.g., induction or probing to elicit more details; see St. Jacques & Schacter, 2013) or by comparing patient groups with detail generation versus output-monitoring deficits.
The fact that memory is subject to contamination under certain circumstances does not make memory inherently unreliable (Brewin, Andrews, & Mickes, 2020; Wixted, Mickes, & Fisher, 2018). The conflation of these two issues is understandable given the potentially catastrophic consequences of memory errors in some scenarios and the utility of errors for revealing underlying memory mechanisms. Responses to our survey suggest that memory researchers and other academics indeed estimate recall to be highly unreliable under encoding and free-recall conditions matching those used here (for related survey data, see Brewin, Li, Ntarantana, Unsworth, & McNeilis, 2019). This perception that memory is generally inaccurate has important consequences, particularly in relation to court testimony. The present data are consistent with the view that laboratory evidence of memory errors may have led to an “overcorrection” in views of memory accuracy (see Wixted et al., 2018, and associated commentaries; see also Brewin et al., 2020; Koriat et al., 2000). Overall, our findings suggest that the details produced in freely recalled narrative descriptions of past experiences are highly accurate, and they remain accurate as memory quantity and specificity change.
Supplemental Material
sj-pdf-1-pss-10.1177_0956797620954812 – Supplemental material for The Truth Is Out There: Accuracy in Recall of Verifiable Real-World Events
Supplemental material, sj-pdf-1-pss-10.1177_0956797620954812 for The Truth Is Out There: Accuracy in Recall of Verifiable Real-World Events by Nicholas B. Diamond, Michael J. Armson and Brian Levine in Psychological Science
Footnotes
Transparency
Action Editor: D. Stephen Lindsay
Editor: D. Stephen Lindsay
Author Contributions
N. B. Diamond and B. Levine developed the study concept. N. B. Diamond analyzed the data and drafted the manuscript. N. B. Diamond and M. J. Armson collected the data and designed the encoding paradigms. B. Levine provided critical revisions. All the authors approved the final manuscript for submission.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
