Abstract
Research in metacognition suggests that the information people use to predict their memory performance can vary depending on the contexts in which they make their predictions. For example, if people judge their memories after a delay from initial encoding, they may be more likely to use retrieved information about the past encoding experience than if they judged memories immediately after encoding. Although this seems intuitive, past behavioral and neuroimaging work has not tested whether delayed memory judgments are more strongly coupled with information about past experiences than immediate memory judgments. We scanned participants using functional MRI while they encoded paired associates and made predictions about their future memory performance either immediately after encoding or after a delay. Consistent with the hypothesis that people use retrieved information about past experiences to inform delayed memory judgments, our results showed that activation patterns associated with past experience were more strongly coupled with delayed memory judgments than with immediate ones.
Did you park near the lake or the water tower when you left to go camping? Monitoring the contents of memory is a critical cognitive function that affects how people prepare for future situations when memories are needed. Lack of confidence may lead to more rehearsal (e.g., that the car is near the lake); greater confidence may lead to no rehearsal at all. Importantly, assessing memory for your car’s location immediately after parking may encourage you to rely on different types of information (cues) than assessing it later in the day. Metacognition research examines how different factors influence memory monitoring by having people make judgments of learning (JOLs) prior to a memory test (e.g., judging memory for paired associates, such as lake–car). In the present study, we examined the neurobiological mechanisms that support memory monitoring and how they utilize different information depending on when memory monitoring occurs relative to initial encoding.
A critical question in neurobiological-metacognition research has been whether brain regions involved in metacognition differ from those involved in primary cognitive processes, such as long-term memory (e.g., Janowsky, Shimamura, & Squire, 1989; Shimamura & Squire, 1986). Early research on this question has suggested that memory monitoring has both distinct and overlapping regions with those underlying long-term memory. The medial temporal lobes (MTLs) are believed to serve primary memory processes such as encoding and retrieval (Dickerson & Eichenbaum, 2010; Haist, Gore, & Mao, 2001) and also track objective memory performance in metacognitive tasks. In contrast, ventromedial prefrontal cortex tracks subjective memory judgments, and lateral and dorsal medial prefrontal cortex tracks both judgments and memory (Kao, Davis, & Gabrieli, 2005). Such results suggest that primary memory regions such as MTL may not contribute directly to metacognitive judgments.
More recently, however, research has suggested that primary memory regions such as MTL are more likely to be activated for memory monitoring when judgments are made some time after the initial study (Do Lam et al., 2012). For example, when judgment is separated from initial encoding (instead of being measured simultaneously, as in Kao et al., 2005), MTL activation during JOLs tends to be higher for subsequently remembered items, suggesting a potential role for this region as a source of information for monitoring, not a locus of metacognition. Using different methodology, Stiers, Falbo, Goulas, van Gog, and de Bruin (2016) also suggested that regions involved in long-term memory retrieval may track JOLs when separated from initial encoding by a delay. They trained a machine-learning classifier to distinguish between activation patterns associated with long- or short-term memory tasks. When they applied this classifier to brain activation elicited by JOLs, the algorithm classified activation underlying JOLs made at a delay, but not activation for JOLs made immediately after encoding, as examples of long-term memory tasks. However, given the “black-box” nature of machine-learning classifiers trained on large numbers of voxels, what information or aspects of the brain’s response to long- and short-term memory tasks are driving the differences between immediate and delayed JOLs remains an open question.
The finding that JOLs measured after a delay from encoding may be more strongly associated with primary memory regions (e.g., MTL) than with JOLs measured immediately after encoding is anticipated by behavioral theories of JOLs. A well-known finding in the behavioral-metacognition literature is that delayed JOLs tend to be more accurate than immediate JOLs (Nelson & Dunlosky, 1991; Rhodes & Tauber, 2011), presumably because participants rely on retrieval attempts to inform delayed JOLs (Nelson, Narens, & Dunlosky, 2004; Serra & Dunlosky, 2005). In contrast, for immediate JOLs, participants tend to use information from various cues—often unrelated to retrieval—to inform their judgments (e.g., the font size of stimuli; see Magreehan, Serra, Schwartz, & Narciss, 2016), which can reduce the accuracy of those judgments.
When framed as a question of cue utilization, it becomes apparent that a key limitation of previous neurobiological studies on how delay impacts JOLs is that they have focused more on how the delay affects which brain regions are involved rather than on how the delay affects the information that people use to make their judgments. With respect to the camping example above, it is specifically the retrieval of information about where the car is parked that is key to assessing memory at a delay, rather than which brain region is activated. Activation in MTL (or any region) does not answer what information people are accessing or whether they are retrieving information at all. Likewise, accuracy of JOLs in predicting subsequent performance is an indirect measure, at best, of retrieval from long-term memory; higher accuracy for delayed JOLs suggests only that some cue that participants are using during delayed JOLs is more predictive of future performance than cues used immediately after encoding.
Given that both regional activation and accuracy are neither necessary nor sufficient to establish that people use information from past experiences to inform delayed JOLs, it is critical to develop methods that specifically measure how the content of memory retrieval is differentially associated with delayed relative to immediate JOLs. Content-based multivariate pattern analysis (MVPA) techniques offer a potential way of testing such hypotheses (Rissman & Wagner, 2012). In content-based MVPA, participants are shown known examples of various image categories or features (e.g., faces, objects, scenes). Later, during retrieval, researchers can estimate the extent to which participants are thinking about, or reactivating, representations of those categories from activation patterns (Polyn, Natu, Cohen, & Norman, 2005; Zeithamova, Dominick, & Preston, 2012). For example, one can predict whether people are remembering a face, object, or scene on the basis of how much their brain-activation patterns overlap with known activation patterns elicited by viewing faces, objects, and scenes. This approach allows us to test, like Stiers et al. (2016), whether delayed JOLs are more strongly coupled with retrieval of information from memory. However, instead of basing retrieval estimates on how strongly an activation pattern resembles processes associated with long-term memory, this method uses information about the similarity of the activation pattern to the content of past study trials.
To test whether delayed JOLs are more strongly coupled with activation of information about previous experiences than are immediate JOLs, we had participants complete a standard paired-associates learning task using pairs of faces, objects, and scenes. On each study trial, participants studied pairs of images, each composed of images from two different categories. On each judgment trial, participants saw one image from each pair (the cue) and judged how likely they would be to recognize its associate later (the target). They made half of the judgments immediately after studying an item (immediate JOLs) and half at a random delay of at least one intervening item (delayed JOLs). If retrieval of information from memory is used more often to inform delayed JOLs, we expected that the similarity of activation patterns to known examples of the target category (faces, objects, or scenes) would be more strongly associated with confidence for delayed JOLs than for immediate JOLs.
We also tested how activation during JOLs tracked confidence and activation of target-relevant information. We selected two a priori regions of interest (ROIs): MTL, thought to be the key region involved in retrieval of information from long-term memory during delayed JOLs (Do Lam et al., 2012), and rostrolateral PFC, known to track confidence and accuracy in metacognitive judgments (Chua, Pergolizzi, & Weintraub, 2014; De Martino, Fleming, Garrett, & Dolan, 2013; Fleming, Huijgen, & Dolan, 2012; Heereman, Walter, & Heekeren, 2015; McCurdy et al., 2013; Morales, Lau, & Fleming, 2018).
Method
Participants
Twenty participants (mean age = 27.1 years, range 18–65; nine women; all right-handed native English speakers) from Texas Tech University and the surrounding community participated for $35 each. We excluded one participant’s data because of motion artifacts, leading to a final sample of 19 participants. Participants gave written informed consent and were screened for MRI safety before beginning the experiment. The Human Research Protection Program of Texas Tech University approved the study protocol. The sample size was determined on the basis of an informal survey of studies on multivoxel reactivation in memory retrieval cited by Rissman and Wagner (2012). No data were analyzed prior to collection of the full sample.
Task
The task consisted of three phases: a localizer phase, a study-and-judgment phase, and a recognition-memory phase. In the localizer phase, participants saw examples of faces, scenes, and objects and indicated which images were of each type by pressing 1, 2, and 3, respectively, on a keypad. Participants had 3 s to respond on each trial. Trials were separated by random fixation (M = 3 s) drawn from a truncated exponential distribution. The localizer phase consisted of two runs of 57 trials each. The goal of this phase was to identify patterns associated with participants’ representations of faces, scenes, and objects and the regions of the brain associated with representing this information.
In the study-and-judgment phase, participants saw pairs of objects (i.e., an object and a scene, an object and a face, or a face and a scene) and then judged how likely they would be to remember a missing item from the pair (the target). No pairs were composed of two images from the same category. We did not include stimuli from the localizer phase in the study-and-judgment phase. On a study trial, participants saw an image pair and studied it for 3 s. On a judgment trial, participants saw one image from the pair (the cue) and were asked to report how confident they were in their memory for the unseen target, using a scale ranging from 1 to 7. The sliding scale started at 4 for each trial (Fig. 1a). Participants had 4 s to move the sliding scale to the desired confidence level, using one button to increase and another to decrease their response. Answers of 4 without the participants’ moving the scale were excluded from further analysis. However, their inclusion or exclusion did not change the nature of any of the outcomes or conclusions. We separated trials by random fixation using the same procedures described above for the localizer phase. All participants received instructions on this study-and-judgment phase before entering the MRI scanner. They completed several practice trials on a laptop with the instructions that they would be studying pairs of images and then judging whether they could recognize the second image in that pair later (none of the practice items were used again during the scanned task). We also told participants that they would judge some pairs immediately after study and some pairs at a delay after study.

Example trials from (a) the study-and-judgment phase and (b) the recognition-memory phase. In the study-and-judgment phase, participants first studied a pair of items. After a delay, one item was then presented; participants were asked to indicate whether they recognized the item as one they had seen in the study pair and to rate their confidence in that judgment. In the recognition-memory phase, a previously presented cue appeared at the top of the screen, and a target and a distractor image appeared on the bottom. Participants were asked to indicate which of the two images on the bottom had been originally paired with the cue at the top.
The study-and-judgment phase consisted of five runs of 24 study trials and 24 judgment trials presented in a pseudorandom order, with unique pairs on each run (no face, object, or scene ever appeared in more than one pair for each participant). Half of the trials were immediate-JOL trials in which the judgment trial came immediately after the fixation for the study trial for that study pair. The other half of the trials were delayed-JOL trials in which the judgment trial was separated from its study trial by at least one study trial for a different image pair. Aside from this constraint, all other aspects of the sequence were random. Although a delayed JOL could be separated by anywhere from 1 to 47 intervening trials from the initial study trial of that pair, the mean number of intervening trials was 8.89 (SD = 11.16). Our randomization procedure departs from blocking procedures used in the behavioral literature, which typically keep delayed JOLs at a fixed interval from their associated study trials. Nonetheless, it is necessary to avoid blocking immediate-JOL trials and delayed-JOL trials for functional MRI (fMRI) applications so that the rating type does not become associated with low-frequency scanner drift. To avoid frequency effects, we ensured that each run had an equal number of immediate-JOL trials and delayed-JOL trials in which each image category (face, object, and scene) was the cue and the target.
After completing each study-and-judgment phase, participants completed a recognition-memory test for those items while still in the scanner, although no neuroimaging data were collected during the recognition-memory phase. The design of the recognition-memory task followed Zeithamova et al.’s (2012) approach and was chosen on the basis of their observed high memory performance (91.8%). High memory performance was desired because our a priori data-analysis plan restricted MVPA to trials with correct responses (see below) to equate subsequent memory effects across levels of confidence. Participants were not in the scanner during the recognition-memory phase. During each recognition phase, participants saw a triad of images (Fig. 1b). A previous cue was at the top of the screen, and the target and a distractor image appeared in random locations (left or right) on the bottom. Participants were asked to indicate which of the two images on the bottom had been originally paired with the cue at the top. The task was self-paced, and each trial was associated with a brief fixed-interval fixation of 3 s. As with the study-and-judgment phase, participants received instructions and completed several practice trials of the recognition phase before entering the MRI scanner.
Imaging acquisition
Neuroimaging data were acquired on a 3T Siemens SKYRA MRI scanner at the Texas Tech Neuroimaging Institute using a 20-channel head coil. Functional runs used a Siemens echo-planar imaging sequence with the following parameter settings: repetition time (TR) = 2,040 ms, echo time (TE) = 25 ms, flip angle (FA) = 70°, field of view (FoV) = 192°, matrix = 64 × 64, number of slices = 41, slice thickness = 2.5 mm (0.5-mm gap). Slices were acquired in ascending order in the axial plane and oriented approximately 30° off the anterior commissure–posterior commissure line to reduce orbital dropout (Deichmann, Gottfried, Hutton, & Turner, 2003). A high-resolution magnetization-prepared rapid-acquisition gradient-echo anatomical scan was also acquired for each participant with the following parameter settings: TR = 1,900 ms, TE = 2.49 ms, FA = 9, FoV = 256°, matrix = 256 × 256, slice thickness = 1 mm, slices = 192.
Preprocessing
The FMRIB Software Library (FSL; Version 5.0; Smith et al., 2004) was used for preprocessing of functional images. Preprocessing included conversion from Digital Imaging and Communications in Medicine (DICOM) format to Neuroimaging Informatics Technology Initiative (NIFTI) format, motion correction using a 6-degrees-of-freedom rigid body alignment to the central volume of each run, skull stripping, smoothing using a 6-mm Gaussian kernel (univariate analysis only), and high-pass temporal filtering (100-s cutoff). The autorecon 1 command in FreeSurfer (Version 8.0; Reuter, Rosas, & Fischl, 2010) was used to preprocess anatomical images.
Multivoxel analysis
Trial-by-trial estimates (β maps) of the blood-oxygen-level-dependent (BOLD) response to each localizer trial (during the localizer phase) and each judgment trial (during the study-and-judgment phase) were extracted from the respective functional images using a least squares-all (LS-A) procedure (Mumford, Turner, Ashby, & Poldrack, 2012). The β maps were registered to standard space and z-scored within runs to remove runwise differences in mean and variance (Lee & Kable, 2018). Voxels were selected from the β maps for further similarity analysis by taking 6-mm spheres around each participant’s univariate peak of localizer activation for face, scene, and object trials within predetermined anatomical ROIs for each localizer stimulus type: right inferotemporal cortex (objects), left fusiform gyrus (faces), and parahippocampal cortex (scenes).
Our primary similarity analysis used a similarity-to-image-category approach akin to neural-typicality measures developed in previous studies (Davis & Poldrack, 2014). The neural-typicality approach is based on exemplar models of categorization, which are akin to kernel-density estimators (Ashby & Alfonso-Reese, 1995). Thus, similarity to image category gives a nonparametric estimate of how relatively likely an activation pattern is, given known examples of activation patterns associated with an image category (faces, objects, scenes) in the localizer phase.
To calculate similarity to image category (Estes, 1994), we first calculated the correlation distances (1 – Pearson’s r/2) between an activation pattern for each JOL trial I and each localizer trial j. The distances were converted to similarities using a standard exponential transform:
where “simcat” refers to similarity to image category, “dist” refers to correlation distance, and “act” refers to activation pattern. Similarity of an activation pattern for JOL trial i to each j trial of a given localizer category k (faces, scenes, or objects) was calculated as
This measure of similarity to image category was used as our primary measure of whether a trial contains information about an image category. If people are thinking about a target item during a JOL, their activation patterns should have higher similarity to known instances of that target item’s category. To test this, for each trial, we collapsed similarity to image category for each k category into measures of similarity to target category, similarity to cue category, and similarity to noncue or target category (the category that was not part of the pair on a given trial). This approach is similar to recent related efforts that computed similarity to an average activation pattern of each category or template to decode selective attention in memory encoding (e.g., Aly & Turk-Browne, 2016). The primary difference is that the similarity-to-image-category approach is based on the similarity to all examples in a category instead of a single average image and thus would allow for likelihood and density estimates in multidimensional spaces not well described by a mean.
Similarity-to-image-category-based (and similarity-to-template-based) measures are similar to, but also critically different from, measures used in previous studies on neural reactivation, which tend to use classifiers (Rissman & Wagner, 2012). On its own, similarity to image category (or similarity to template) tells only how similar an activation pattern is to a specific category (faces, scenes, or objects); it does not say whether one of those categories is overall more likely, as a classifier would. Classifiers address whether there is relatively more similarity or likelihood for one class versus another. In paired-associates tasks (or tasks in which more than one category is on the screen), classifiers can be less appropriate because they reflect different cognitive states by being concerned only with the relative likelihood of one image category versus others. Thus, a trial in which a participant is strongly activating patterns associated with both cue and target categories (e.g., using the cue to prime retrieval of a target) can have the same classifier likelihood (approximately equal target and cue activation) as a trial in which they are not paying attention or thinking about either. Similarity to image category, on the other hand, should be monotonically related to the strength of activation patterns for a given image class. Similarity to image category for the target category will be higher if people are thinking about the target category and lower if they are not, regardless of whether they are also thinking about the cue category. Similarity to other categories does not affect similarity to image category in the same way it affects classifier output.
Nonetheless, a measure of relative category evidence can be constructed using similarity to image category to compare our results with those of other studies using classifiers. To do this, we created a classification probability for each image class k given an activation pattern, act i :
where the summation in the denominator is taken over all K categories. This equation is a standard way of turning exemplar-based similarity measures, like similarity to image category, into a choice probability (e.g., Ashby & Alfonso-Reese, 1995). It also highlights our earlier point about how such classification probabilities reflect different cognitive states; the same amount of information about category k could be retrieved, but different probabilities of k could be estimated simply by changing the denominator (similarity to other categories).
We tested how our similarity-to-image-category measures related to JOL type (delayed vs. immediate) and participants’ confidence using linear mixed-effects models implemented in the lme4 package in R (Version 4.0.3; Bates, Mächler, Bolker, & Walker, 2015). Degrees of freedom for statistical tests were derived using Satterwaithe’s method. Our aim in all tests was to model all possible random-effect terms (random slopes for JOL type, confidence, and their interaction), or “keep it maximal” (Barr, Levy, Scheepers, & Tily, 2013). Models with any random-effects parameters approaching zero (boundary fits) were retested with simplified random-effects structures to verify results, but none of our results changed, so only results from maximal models are reported. Bootstrapped 95% confidence intervals (CIs) are reported for parameters of interest using the lmerTest package (Kuznetsova, Brockhoff, & Christensen, 2017).
As discussed above, we planned a priori to follow convention in MVPA studies of long-term memory (Rissman & Wagner, 2012) by focusing on trials with correct responses (often referred to as “subsequently remembered”) to attempt to equate memory across confidence levels (1–7) and across immediate and delayed JOLs. However, our results and conclusions are not dependent on this convention.
Univariate analysis
Univariate analysis employed a standard three-level mixed-effects model implemented in FSL’s FMRI Expert Analysis Tool (FEAT; Woolrich, Ripley, Brady, & Smith, 2001). The first level modeled the effects of task-related regressors within individual scanning runs. Task-based regressors were convolved with a canonical double-γ hemodynamic response function. Nuisance regressors consisted of temporal derivatives for the task variables, trials with no response, realignment parameters obtained from motion correction, their temporal derivatives, and volume-wise indicator variables for scrubbed volumes that exceeded a framewise displacement of 0.9 mm (Siegel et al., 2014); we applied prewhitening to account for temporal autocorrelation. Parameter estimates from the first-level models were combined into subject-level estimates using a fixed-effects model at the second level. At the third level, a permutation test treating subjects as a random effect was used to create the final statistical maps. The permuted maps were corrected for multiple comparisons using a cluster-mass thresholding with a primary threshold (t) of 2.878 (critical one-tailed t for 18 df at α = .005) and 8-mm variance smoothing. Final thresholding included a whole-brain correction for multiple comparisons as well as small-volume corrections for a priori ROIs in the bilateral hippocampus and the frontal polar cortex (rostrolateral PFC is a subregion of frontopolar cortex). All anatomical masks came from the Harvard-Oxford Atlas (Desikan et al., 2006).
The study-and-judgment phase model included (a) task-based regressors for all study trials, (b) separate regressors for trials with correct responses (subsequently remembered; Rissman & Wagner, 2012) in the recognition-memory phase and trials that were incorrect (subsequently forgotten) for immediate-JOL trials and delayed-JOL trials, (c) separate regressors for immediate-JOL and delayed-JOL trials that included a participants’ confidence rating as a modulator, and (d) separate regressors (parametric modulators) for immediate-JOL and delayed-JOL trials modulated by the neural similarity to the target category for each trial. The contrasts of interest were whether immediate-JOL and delayed-JOL trials differed in terms of how strongly BOLD activation tracked confidence (delayed-JOL confidence – immediate-JOL confidence) and whether immediate-JOL and delayed-JOL trials differed in terms of how strongly BOLD activations tracked our multivoxel measures of similarity to the target category (delayed-JOL target similarity – immediate-JOL target similarity). Because these regressors were modeled simultaneously at Level 1, each result reflects the unique partial effect of confidence or target similarity on activation, adjusting for the other.
Results
Behavioral results
Recognition performance
A mixed-effects logistic regression revealed that recognition performance was equivalent for immediate-JOL items (M = 87.8% correct, SD = 6.98) and delayed-JOL items (M = 86.9% correct, SD = 7.72), z = 0.65, b = 0.08, SE = 0.13, p = .519, 95% CI = [−0.18, 0.36]. These results suggest that delaying judgment of items did not alter the strength of memory itself (see also the meta-analysis by Rhodes & Tauber, 2011). We further tested whether delay affected memory by examining whether accuracy within delayed trials differed as a function of the number of intervening items and found that number of intervening items did not affect accuracy, z = −1.43, b = 0.00, SE = 0.01, p = .154, 95% CI = [−0.03, 0.01].
JOL magnitude
Table 1 presents participants’ mean JOL magnitude for immediate-JOL and delayed-JOL items. JOL magnitude was higher for immediate-JOL items (M = 6.16, SD = 0.82) than for delayed-JOL items (M = 5.18, SD = 0.73), t(18) = 4.49, p < .001, d = 1.03. We also conducted an analysis to determine whether the length of the delay between studying and judging a pair affected participants’ confidence levels. A linear mixed-effects model revealed a significant effect of delay length on confidence: Confidence decreased as delay length increased, b = −0.04, SE = 0.01, t(17) = −4.70, p < .001, 95% CI = [−0.05, −0.02]. The frequencies of JOL values for both immediate and delayed JOLs is depicted in Table S1 in the Supplemental Material available online.
Mean Recognition Performance and Judgments-of-Learning (JOL) Magnitude
Associations between JOLs and performance
As discussed above, the delayed-JOL effect is a well-known behavioral effect whereby delayed JOLs tend to be more strongly associated with actual memory performance than immediate JOLs. Previous studies demonstrating this effect have tended to use cued-recall tests (e.g., Nelson et al., 2004; Serra & Dunlosky, 2005), and thus it is an open question whether we would see such strong coupling for delayed JOLs in the present case, in which JOLs are made in cued-recall-style trials but tested in a two-alternative forced-choice recognition test (Fig. 1b). On the one hand, retrieval of information about past study, as encouraged by target-absent delayed JOLs, should be informative for recognition tests. On the other hand, recognition-memory tests themselves may not require such retrieval, and thus JOLs may be less predictive than they would be if there were greater alignment between JOLs and the test. That is, cued-recall-style JOLs could encourage transfer-inappropriate monitoring when tests are recognition based rather than retrieval based (cf. Dunlosky & Nelson, 1997).
Although a number of ways of testing the delayed-JOL effect have been proposed, here we used a Bayesian signal-detection-theory approach that allowed us to estimate correspondence between memory performance and judgment (hierarchical meta-d′ [HMeta-d′]; Fleming, 2017). Correspondence between memory performance and judgment, or metacognitive efficiency, is given in HMeta-d′ as the M ratio; the ratio of participants’ metacognitive sensitivity (meta-d′) to their actual memory sensitivity (d′). Higher M ratios for delayed JOLs compared with immediate JOLs would be consistent with a behavioral delayed-JOL effect.
To test whether immediate JOLs and delayed JOLs differed in metacognitive efficiency, we employed the standard HMeta-d formulation for repeated measures data, which assumes participants’ log-M ratios for immediate and delayed JOLs are drawn from a joint multivariate normal distribution (Fleming, 2017). To aid convergence, we did not include a covariance between delayed and immediate log-M ratios in the model specification. This did not impact the overall results, and when a covariance was estimated, its 95% highest posterior density interval was centered around zero. We fitted HMeta-d to participants’ accuracy and JOL data using three Markov chains, 200,000 sample adaptations, a 1,000,000,000 sample burn-in, 1,000,000,000 iterations, and 100-sample thinning for a total of 10,000 posterior distribution estimates for each chain. Gelman-Rubin convergence criteria were acceptable for all model parameters (maximum Rˆ
Multivoxel analysis
Discriminability of multivoxel patterns
The primary goal of the multivoxel analysis was to test whether there were stronger associations between confidence and activation of patterns consistent with the target image category (faces, objects, or scenes) during delayed than immediate JOLs. As a preliminary test of whether activation patterns contain information about the image categories, we tested whether our classifier (Equation 3) could successfully classify the category of the cue stimulus (the item pair that was on the screen during JOLs; Fig. 1a). We found that overall accuracy in classifying the cue stimuli was significantly above chance (.33) across participants in each image category—faces: mean accuracy = .66, difference from chance = .31, 95% CI = [.26, .38], t(18) = 10.9, p < .001; scenes: mean accuracy = .46, difference from chance = .13, 95% CI = [.07, .19], t(18) = 4.83, p < .001; objects: mean accuracy = .42, difference from chance = .09, 95% CI = [.04, .14], t(18) = 3.72, p = .002—suggesting that activation patterns elicited during JOLs contain information about stimulus category for items on the screen.
Associations between multivoxel patterns and JOLs
Our overall hypothesis was that activation of information associated with the target category should increase as a function of confidence for delayed JOLs, but not immediate JOLs, if people are using information about the previous study trial to inform their JOLs when those judgments are made at a delay. To test this, we collapsed similarity to image category (Equation 2) for the different image categories into one measure of similarity to target category and regressed this measure on the interaction between JOL confidence and JOL type using a linear mixed-effects model. We found a significant interaction: There was a stronger association (more positive slope) between confidence and the similarity to target category during delayed-JOL trials than during immediate-JOL trials, b = 0.05, t(61.72) = 3.01, p = .004, 95% CI = [0.02, 0.08] (see Fig. 2 and Figs. S3 and S4 in the Supplemental Material). A follow-up simple-slopes analysis revealed a significant positive association between confidence and similarity to target for delayed JOLs, b = 0.031, t(15.42) = 2.70, p = .02, 95% CI = [0.01, 0.05], and a significant negative slope between confidence and similarity to target for immediate JOLs, b = −0.03, t(149) = −2.09, p = .039, 95% CI = [−0.05, 0.00]. The negative result was not predicted in this case but indicates that, for immediate JOLs, there was more similarity to target for less confidently rated items than for more confidently rated items. Together, the results suggest that participants were activating information about the target category more during delayed JOLs than immediate JOLs, so that greater confidence is associated with greater similarity of elicited activation patterns with those of the target category.

Neural similarity to the target item observed during immediate and delayed judgment-of-learning (JOL) trials as a function of participants’ confidence level. Trials in which participants rated their confidence as 4 were removed because most were trials in which participants failed to adjust the scale. The gray bands around each trend line indicate 95% confidence intervals.
Our similarity-to-image-category measure tells how much an activation pattern elicited during JOLs resembles those of activation patterns of known image categories in the localizer phase. It does not take into account similarity to other categories, such as the cue category or the nonpresent category (i.e., the image category that was associated with neither the target nor the cue stimulus), as would be common in classification approaches. This is because classification predictions would then be dependent among the stimulus categories; the more similarity to cue, the less classification probability for the target, holding target similarity constant. Nonetheless, it can be useful to assess whether target-category classification, derived from relative similarity of the target category to all possible categories (Equation 3), would show the same results as a general similarity to target category. Target-category-classification probability showed the same interaction as that observed for similarity to target category: There was a stronger association between target-category classification probability and confidence for delayed JOLs than for immediate JOLs, b = 0.0005, t(22.8) = 2.31, p = .030, 95% CI = [0.00, 0.001]. These results suggest that not only is similarity-to-target-category activation more strongly associated with confidence for delayed JOLs than immediate JOLs but also at least some of the information driving this interaction is diagnostic of the target category in the sense that the likelihood of classifying an activation pattern as an example of the target category increases with increasing confidence for delayed JOLs.
To test how the length of delay (number of intervening trials) affected the similarity-to-target-category results, we included an interaction between delay and confidence in the linear mixed-effects model testing the simple slope relating confidence to the similarity to target category for delayed JOLs. There was no interaction between confidence and delay in their effect on similarity to target category, F(1, 25.4) = 1.34, p = .257, nor was there a main effect of delay, F(1, 22.0) = 2.26, p = .147. However, there remained a significant effect of confidence: As confidence increased, there was greater similarity to category, b = 0.04, t(18.8) = 2.90, p = .009, 95% CI = [0.01, 0.07] (see Fig. S1). These results suggest that ultimately the overall amount of delay does not matter for observing the greater association between activation of target-category information and confidence, only that there is some filled delay (at least one intervening study trial). This is consistent with behavioral studies suggesting that cue-utilization differences (e.g., use of information about past study trials) between delayed and immediate JOLs can be obtained with as few as one intervening study item (cf. Bui, Pyc, & Bailey, 2018; Kelemen & Weaver, 1997).
Although our results suggest that our similarity-to-target-category measure reflects activation of information about past study trials that is more strongly associated with delayed than immediate JOLs, it is useful to test whether we see similar effects in terms of similarity to the cue’s category and similarity to the nonpresent category. On the one hand, these measures were already included in the classification analysis showing that some similarity-to-target-category effects are driven by diagnostic information, and thus there has to be some effect of confidence that is unique to similarity to target. However, on the other hand, the classifier probability takes into account similarity to all three categories (target, cue, nonpresent) in making predictions, and thus it can be useful to separately verify that similarity-to-cue and similarity-to-nonpresent measures do not show the same exact pattern as similarity-to-target-category measures; if they did, it might suggest that some of the effects are due to more global signals related to confidence or memory as opposed to target-category information per se. Consistent with our hypothesis that similarity to target reflects activation of information associated with the target category and not just a global signal that would affect any similarity-based measures, our results did not reveal a difference in association between confidence and similarity to cue category for delayed and immediate JOLs, b = 0.01, t(23.2) = 0.34, p = .738, 95% CI = [−0.03, 0.05] (see Fig. S2 in the Supplemental Material). Likewise, we did not find a difference in association between confidence and similarity to nonpresent category for delayed and immediate JOLs, b = 0.00, t(27.55) = −0.03, p = .976, 95% CI = [−0.04, 0.04] (see Fig. S3 in the Supplemental Material). Together, these results suggest that our observation of increasing similarity to image category as a function of confidence for delayed JOLs is unique to the similarity to target category, which is consistent with our overall hypothesis that people use retrieval of target information more for informing delayed JOLs than immediate JOLs.
Univariate results
The multivoxel results are consistent with our overall hypothesis that participants use information about past study episodes as a basis (cue) for delayed JOLs. In addition, we examined how brain regions associated with delayed JOLs and confidence in previous studies activated as a function of confidence and our similarity-to-target-category measure in the present study. These results allow us to connect our study to previous work on memory monitoring as well as to examine how the information measured by our similarity to target may be activated by the brain during delayed JOLs.
Given that previous memory-monitoring research suggests that the MTL becomes more activated to support JOLs after a delay, we selected the hippocampus—a key region for retrieval of information from long-term memory—as an ROI. We expected hippocampus activation to correlate with our trial-by-trial estimates of how much information was being retrieved about target items (i.e., similarity to target) for delayed JOLs but not for immediate JOLs.
This analysis is a form of multivoxel connectivity analysis in the sense that it shows how similarity to target in our target-category ROIs correlates with activation in other brain regions. Regions that correlate with similarity to target thus are sensitive to differences in the amount of target information retrieved but do not necessarily represent the target category themselves per se. Consistent with the hypothesis that the hippocampus should be associated with the amount of target-category information retrieved, in an a priori hippocampal ROI, our results showed bilateral clusters that tracked similarity to target more for delayed JOLs than for immediate JOLs (see Fig. 3; Cluster 1: voxels = 45, p = .021, x = 22, y = −24, z = −14; Cluster 2: voxels = 47, p = .022, x = −20, y = −20, z = −16). 1 When contrasting immediate JOLs and delayed JOLs at the whole-brain level, we also found a cluster encompassing the dorsal anterior cingulate gyrus, paracingulate gyrus, and superior frontal gyrus (see Table S2 and Fig. S10 in the Supplemental Material). The dorsal anterior cingulate and premotor regions that tracked this measure are often involved in representing confidence and uncertainty; for the delayed JOLs, information about the target category is hypothesized to feed into these regions. Given these regions’ strong connection to confidence, it is interesting that they correlated with our hypothesized predecisional input into confidence ratings (information about target-category retrieval) and not participants’ actual confidence ratings, when both were modeled simultaneously (see below).

Results from the univariate analysis. Activation in frontal pole and rostrolateral prefrontal cortex (a) is shown for the comparison between delayed and immediate confidence ratings (small-volume-corrected with frontal pole mask). Activation in bilateral hippocampus (b) is shown for the comparison between delayed and immediate similarity to target (small-volume-corrected with bilateral hippocampal mask). Values for x, y, and z are given in Montreal Neurological Institute coordinates. R = right.
In addition to the MTL, rostrolateral PFC is a key region underlying metacognitive judgments and is known to be associated with monitoring accuracy (Chua et al., 2014; Fleming et al., 2012; Heereman et al., 2015; McCurdy et al., 2013; Morales et al., 2018; Yang et al., 2015). We examined whether rostrolateral PFC tracked confidence more strongly for delayed than immediate JOLs. In an a priori frontal-pole ROI, a frontal-pole cluster including the rostrolateral PFC showed stronger associations with confidence in delayed JOLs compared with immediate JOLs (see Fig. 3; voxels = 248, p = .002, peak: x = −2, y = 60, z = 14). At the whole-brain level, significant clusters were observed in the inferior temporal gyrus as well as the frontal orbital cortex that tracked confidence more for delayed JOLs than immediate JOLs (see Table S3 and Fig. S11 in the Supplemental Material). These results help to connect the present study with previous metacognition studies that have focused on the rostrolateral PFC with respect to accuracy. Interestingly, given the lack of robust behavioral differences in accuracy between delayed and immediate JOLs, this difference in association between activation in rostrolateral PFC and confidence for delayed versus immediate JOLs is more likely to reflect potential strategic differences in the cues utilized to inform the JOLs, as opposed to accuracy per se. However, future research will be needed to firmly disentangle cue utilization from accuracy.
Discussion
Whether different processes or systems are used when JOLs are made at a delay has inspired research both in cognitive psychology and neuroscience. Here, we reframed this critical question as one of information or cue utilization: We used MVPA to test the hypothesis whether people utilize different information (cues) when making memory-monitoring judgments after a delay instead of immediately after initial learning. Stronger activation of multivoxel patterns for previously studied faces, scenes, and objects was associated with higher confidence during delayed JOLs but not immediate JOLs. Further, activation in the hippocampus tracked our multivoxel measure of information retrieval more for delayed JOLs than immediate JOLs, consistent with the idea that retrieval mechanisms underlying hippocampal activation may be harnessed more when making memory judgments at a delay. Our study is the first to show, with neuroimaging data, that the brain activates information associated with previous study episodes to inform delayed JOLs. These results provide key insights and clarifications for both the neuroimaging and behavioral literatures on metacognition.
Behavioral studies on metacognition have often focused on accuracy—the correspondence between judgments and actual memory—to test theories of how different types of memory judgments may utilize different processes or information. However, with respect to the present question of whether information about past study episodes is used to inform delayed JOLs more than immediate JOLs, accuracy has proven neither necessary nor sufficient for establishing process or information-based dissociations. Accuracy itself is only weakly tied to the JOL process; greater accuracy on its own can answer only whether some cue that people are using during JOLs is more informative about their upcoming test performance, but it cannot necessarily tell what this cue is. Further, accuracy depends heavily on correspondence between the processing that people are using to generate JOLs and the demands of later tests, or transfer-appropriate monitoring (Dunlosky & Nelson, 1997). If retrieving information about past studies informs the JOLs but is itself not inherently predictive of future test performance, as we saw with metacognitive accuracy in the present task (in which JOLs were measured under cued-recall instructions but were memory tested using recognition memory), behavioral accuracy may not differentiate the cues used at all. Our MVPA measure of activation of target-category information, on the other hand, relies on brain activation during JOLs and thus is a potentially more straightforward test of whether people use different cues to make immediate versus delayed JOLs. Future studies might test whether our failure to observe greater accuracy for delayed JOLs was due to transfer-inappropriate monitoring or some other mechanism or whether there are more subtle effects of different levels of delay on cue usage (consistent with findings of previous behavioral studies, though our results suggest that a filled delay of even one study trial was sufficient to make a JOL delayed). Nevertheless, our results lead to an important observation: Future test behavior itself is only inconsequentially related to the processes occurring during JOLs themselves, and thus measures of online information utilization, such as the ones we introduce here, will be critical for future studies of how different monitoring demands influence JOLs.
Our study also offers insights and questions for neurobiological memory-monitoring research. Previous studies on the neurobiology of memory monitoring have focused heavily on dissociations in regional activation for different JOL types. When made simultaneously with encoding, JOLs generally do not correlate with memory regions such as the hippocampus (Kao et al., 2005), but such regions become activated when JOLs are made at a delay from encoding (Do Lam et al., 2012). As with accuracy, regional activation cannot be tied directly to differences in information usage, as it is not possible to infer, on the basis of activation of the hippocampus, that it is retrieving information about past experiences. However, by showing that the hippocampus is associated with our multivoxel measure of target information, we are better able to suggest that this region is involved in using such information to inform delayed JOLs. Importantly, this does not mean that retrieval is not happening during immediate JOLs, but at the very least it suggests that less weighting occurs in immediate than in delayed JOLs. A key question for future research is not only whether this difference in utilization constitutes a process-level dissociation, as some JOL theorists have suggested, but also how exactly delay (or any other factor) leads people to use different cues to inform their JOLs.
In summary, activation of information about past experiences was associated with confidence for delayed but not immediate JOLs. This result helps to unite the behavioral- and neuroimaging-metacognition literatures by showing that neuroimaging data can be used to test fine-grained hypotheses about the information people use to make metacognitive judgments.
Supplemental Material
sj-docx-1-pss-10.1177_0956797620958004 – Supplemental material for Delayed Judgments of Learning Are Associated With Activation of Information From Past Experiences: A Neurobiological Examination
Supplemental material, sj-docx-1-pss-10.1177_0956797620958004 for Delayed Judgments of Learning Are Associated With Activation of Information From Past Experiences: A Neurobiological Examination by Timothy D. Kelley, Debbie A. McNeely, Michael J. Serra and Tyler Davis in Psychological Science
Footnotes
Transparency
Action Editor: Caren Rotello
Editor: D. Stephen Lindsay
Author Contributions
All authors contributed to the study concept and the study design. Testing and data collection, analysis, and interpretation were performed by T. D. Kelley and T. Davis. All authors were involved in the writing of the manuscript and approved the final version of the manuscript for submission.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
