Abstract
We evaluated within-person variability across a cognitive test battery by analyzing the shape of the distribution of each individual’s scores within a battery of tests. We hypothesized that most healthy adults would produce test scores that are normally distributed around their own personal battery-wide, within-person (wp) mean. Using cross-sectional data from 327 neurologically healthy adults, we computed each person’s mean, standard deviation, skew, and kurtosis for 30 neuropsychological measures. Raw scores were converted to T-scores using three degrees of calibration: (a) none, (b) age, and (c) age, sex, race, education, and estimated premorbid IQ. Regardless of calibration, no participant showed abnormal within-person skew (wpskew) and only 10 (3.1%) to 16 (4.9%) showed wpkurtosis greater than 2. If replicated in other samples and measures, these findings could illuminate how healthy individuals are endowed with different cognitive abilities and provide the foundation for a new method of inference in clinical neuropsychology.
Introduction
Many investigators have examined the range of intra-individual variability (IIV) in cognitive test performance, beginning with the concept of test score “scatter” (Wechsler, 1939). Kaufman (1976) proposed the consideration of inter-subtest scatter as measured by the difference between a person’s highest and lowest Wechsler subtest scores, also referred to as “maximum discrepancy” (Schretlen et al., 2003). Others advocate computing an intra-individual standard deviation (ISD) based on a person’s z-transformed test scores (Lindenberger & Baltes, 1997). Still others have defined IIV in terms of the deviation of a person’s test scores from his or her own mean test score (McLean et al., 1989) or IQ (Rabbitt, 1993). Importantly, each of these approaches ignore skewness and kurtosis, parameters required to assess the shape of within-person test score distributions.
Cognitive IIV has been demonstrated in healthy aging as well as numerous clinical populations, including traumatic brain injury, HIV, and neurodegenerative diseases (Bangen et al., 2019; Dykiert et al., 2012; Vance et al., 2021). The literature has shown that increased cognitive IIV predicts cognitive decline, abnormal neuroimaging findings, and even mortality (Halliday et al., 2019; Hilborn et al., 2009; Holtzer et al., 2008; Vance et al., 2021). In a healthy aging sample, increased cognitive IIV was associated with reduced white matter volume and increased frontal lobe white matter hyperintensities (Bunce et al., 2007), as well as increased frontal lobe activation during cognitive task performance (Bellgrove et al., 2004). However, Matarazzo (1990) and Kline et al. (1993) provided convincing evidence that such cognitive IIV does not measure what we think it does. For example, Matarazzo found that greater variability among cognitive tests was associated with higher, not lower, IQ.
Despite the above-mentioned advancements, we still do not have established population norms for cognitive IIV. In clinical practice, neuropsychologists compare a person’s performance on any given cognitive test with the score distribution of a normative sample as opposed to an examinee’s own within-person distribution of test scores. This raises at least two questions: First, do most healthy people produce scores that are normally distributed on a battery of cognitive tests? Second, does the shape of a person’s test score distribution depend on his or her own overall mean performance? If healthy people produce normal within-person test score distributions regardless of their overall ability, then departures from normality might help diagnose cognitive dysfunction.
Consider the hypothetical scenario in which two people, X and Y, both earn T-scores of 52 on a verbal fluency test (test A). This places both their performances in the middle of a normative distribution (Figure 1A). However, Figure 1A sheds no light on how their verbal fluency performance compares with their other test scores. Figure 1B and 1C plots the same T-scores relative to each person’s own test score distribution, which we created for this purpose. Person X’s verbal fluency T-score of 52 represents a personal strength in a hypothetical within-person distribution that is positively skewed, more variable than usual, and has a low average mean. Person Y’s verbal fluency T-score of 52 represents a personal weakness in a minimally negatively skewed distribution with typical variability, and an above average mean. This shows that identical population-based T-scores can have different clinical meanings in the context of a person’s own test score distribution. What remains unknown at present is whether the cognitive test scores of most healthy persons are normally distributed and whether parameters of within-person test score distributions vary as a function of a person’s overall ability, demographic characteristics, or some other factor(s). If most healthy adults produce normally distributed scores across a test battery, then measuring departures from within-person normality might prove useful for diagnosis and offer a novel approach to clinical inference.

(A) Normative Distribution. Scores on Test A for Two Hypothetical Participants (Persons X and Y) Are Starred. Note That Persons X and Y Earned the Same Score on Test A. (B, C) Score Distributions for Persons X and Y, With the Score for Test A Starred in Each. Note That Despite Having Identical Scores on Test A, the Participants’ Overall Distributions of Scores on the Entire Test Battery Differ From One Another. For Example, Test A Is a Relative Strength for Person X But a Relative Weakness for Person Y. Despite These Differences, Neither Distribution Is Markedly Abnormal.
In this study, we extend existing IIV analyses by measuring the overall shape of within-person (wp) test score distributions. We hypothesized that most healthy adults would produce test score distributions that are normal (i.e., with minimal within-person skew [wpskew] and kurtosis [wpkurtosis]) on a battery of cognitive tests (Schretlen & Sullivan, 2013). We previously found that healthy older adults differed from patients with cognitive impairment not only in terms of within-person mean (wpmean) scores and standard deviations (wpSD), but also in terms of wpskew and wpkurtosis (Reckess et al., 2014). In that cross-sectional study, the within-person test score distributions of patients shifted increasingly away from normality with the severity of their cognitive impairment (Reckess et al., 2014). In the current study, we examine the range of four parameters (wpmean, wpSD, wpskew, and wpkurtosis) of the within-person score distributions produced by healthy adults.
Finally, the current study also aimed to determine whether within-person distributions across the adult life span vary as a function of participant characteristics or calibration for them. Applying demographic calibrations can increase the precision of expected performance, but its impact on the properties of within-person distributions is unknown.
Method
Participants
The study sample consists of 327 neurologically normal participants in the Johns Hopkins Aging, Brain Imaging, and Cognition (ABC) study, which was used to derive normative data for the Calibrated Neuropsychological Normative System (CNNS) software program (Schretlen et al., 2010). The ABC study was approved by our institution’s Institutional Review Board, and all participants gave written informed consent. Data collection procedures and sample characteristics are described in detail elsewhere (Schretlen et al., 2008). Participants ranged in age from 18 to 90 (M = 54.8, SD = 18.8). As shown in Table 1, the sample included non-Latino White (80.2%), Black (18.0%), and Asian/Latino/Other (1.8%) participants, and slightly more women (56.6%) than men. Participants completed from 3 to 20 years (M = 14.2, SD = 3.0) of schooling. Exclusionary criteria included presence of neurological disease (e.g., Parkinson’s disease, Alzheimer’s disease, history of stroke or brain injury), severe mental illness (e.g., bipolar disorder, schizophrenia), or current substance abuse/dependence.
Demographic Makeup of the Sample.
Neuropsychological Measures
The current analyses were conducted using 30 measures (Table 2) derived from 19 tests included in the CNNS battery. These measures capture major cognitive domains while maximizing the number of participants with complete data. Where possible, summary scores (e.g., total letter-cued fluency) were used rather than scores for individual trials. Raw scores were converted to T-scores using CNNS software. For each measure, three T-scores were derived: (a) uncalibrated (i.e., direct conversion from scaled scores to T-scores for the entire sample), (b) age-calibrated, and (c) fully calibrated (i.e., adjusted for age, sex, race, years of education, and estimated premorbid IQ based on performance on the Hopkins Adult Reading Test (HART; Schretlen et al., 2009). The methods used to derive the regression-based norms on which these T-scores were based are described elsewhere (Testa et al., 2009).
Neuropsychological Measures.
Folstein et al. (1975). bBenton et al. (1994). cKlove (1963), Reitan and Davison (1974). dSalthouse and Babcock (1991). eArmy Individual Test Battery (1944). fWechsler (1981, 1997). gSchretlen (1997). hAxelrod and Millis (1994), Shallice and Evans (1978). iSchretlen (2010). jSchretlen and Vannorsdall (2010). kSelnes et al. (1988). lBenton et al. (1994). mCorwin and Bylsma (1993), Osterrieth (1944), Rey (1941). nWechsler (1987). oBenedict (1997). pBrandt and Benedict (2001). qWechsler (1987). rManning et al. (2007). WAIS = Wechsler Adult Intelligence Scale; WMS = Wechsler Memory Scale; R = Revised; III = Third Edition.
Analyses
For the 327 study participants, we first computed each person’s wpmean, wpSD, wpskew, and wpkurtosis based on his or her T-scores using the three levels of calibration described above. Using Pearson’s (r) correlations, we examined associations among the resulting 12 parameters (i.e., four parameters at each level of calibration). We also examined association between each parameter and five participant characteristics, using Pearson’s r for age, education, and estimated IQ or point biserial (rpb) correlations for sex and race.
We next examined the 327 within-person test score distributions for departures from normality in two ways. We first conducted Kolmogorov–Smirnoff (KS) to test the normality of wpmean, wpSD, wpskew, and wpkurtosis distributions at each calibration level for normality. We then tallied the number of participants who produced wpskew and wpkurtosis coefficients that exceeded an absolute value of ±2, as recommended by Hair (2021) to establish non-normality. To test the effects of score calibration on wpmean, wpSD, wpskew, and wpkurtosis, we also conducted three paired-samples t-tests (uncalibrated vs. age-calibrated, uncalibrated vs. fully calibrated, and age-calibrated vs. fully calibrated) for each parameter.
Finally, we sought to determine whether specific cognitive measures contributed to high levels of within-person variability, as this could indicate within-person distributional properties depend on the specific tests used. To examine this, we tallied the percentage of participants for whom each measure appeared as the highest or lowest of each person’s score distribution. If a person obtained identically high or low scores on two measures, both were counted. All statistics were conducted using IBM SPSS Statistics version 28.
Results
Table 3 shows mean, minimum, and maximum values for the wpmean, wpSD, wpskew, and wpkurtosis coefficients by calibration level, along with the results of KS tests for the normality of each distribution parameter. Regardless of calibration level, wpmean, wpSD, and wpskew were normal. While the KS test for age-calibrated wpskew was nominally significant (p = .047), it did not survive Bonferroni adjustment for multiple comparisons (p < .013). In contrast, KS tests showed that distributions of wpkurtosis departed from normality at every level of test score calibration (all ps < .001).
Within-Person Test Score Distribution Parameter Values for Three Levels of Test Score Calibration and Kolmogorov–Smirnov Tests of Normality for Each Parameter’s Distribution.
Note. wp = within-person. SD = standard deviation. For calibration, none = no calibration; age = age-calibrated; full = calibrated age, sex, education, race, and estimated premorbid IQ.
Table 4 presents the frequency distributions of wpmean, wpSD, wpskew, and wpkurtosis by calibration level. As shown, 95% of the 327 participants produced a wpmean between 38.9 and 58.4 for age-calibrated scores or 41.1 and 57.9 for fully adjusted scores, and 95% produced a wpSD between 5.6 and 11.3 for age-calibrated scores or 6.2 and 12.2 for fully calibrated scores. Notably, 95% of participants produced wpskew coefficients that ranged from –1.1 to .82 for age-calibrated or .78 for fully calibrated scores. No participant produced a wpskew coefficient that exceeded the absolute value of 2, the threshold conventionally accepted to signify departure from normality (Hair, 2013). In fact, just three persons produced a wpskew > 1 for age-calibrated scores and just one produced wpskew > 1 with fully calibrated scores. More participants showed negative than positive wpskew with age-calibrated (61.2%) and fully calibrated (57.8%) scores, but not with uncalibrated (48.3%) scores. Consistent with the KS results, a few participants produced wpkurtosis coefficients that exceeded 2. All of these were positive (leptokurtic), and the number of participants with wpkurtosis coefficients >2 decreased with level of test score calibration (uncalibrated, 4.9%; age-calibrated, 3.7%; and fully calibrated, 3.1%). In short, while KS testing revealed non-normal distributions of wpkurtosis, fewer than 4% of participants showed markedly leptokurtic within-person test score distributions with age-calibrated or fully calibrated scores.
Frequency Distributions of Within-Person Parameters by Score Calibration Level.
Note. wp = within-person. SD = standard deviation. For calibration, none = no calibration; age = age-calibrated; full = fully calibrated (scores calibrated for age, sex, education, race, and estimated premorbid ability).
We next examined associations among within-person distribution parameters. These analyses involved 18 Pearson’s r correlations (i.e., uncalibrated wpmean with uncalibrated wpSD, wpskew, and wpkurtosis, followed by uncalibrated wpSD with wpskew and wpkurtosis, followed by wpskew with wpkurtosis, and then repeating these for age-calibrated and fully calibrated scores). To maintain an α of <.05, we applied a Bonferroni correction (p < .003) to reject the null hypothesis of no association. Most correlations were small and non-significant, but three age-calibrated parameters survived Bonferroni correction: wpmean with wpSD: r = −.28; p < .001, wpmean with wpskew: r = .22; p < .001, and wpskew with wpkurtosis: r = −.35; p < .001). Thus, when using age-calibrated test scores, participants with higher wpmeans showed within-person distributions that were slightly less variable, more negatively skewed (i.e., with more scores clustering above and a few scores further below their wpmeans), and slightly more leptokurtic than those with lower wpmeans. Notably, while statistically significant, these three correlations were relatively weak. In sum, test battery wpmeans were nearly identical across levels of calibration, wpSDs increased with degree of calibration (from none to fully calibrated), and wpskew and wpkurtosis averaged close to zero across study participants.
Next, correlations between within-person distribution parameters and participant characteristics are shown in Table 5. Using a Bonferroni correction of p < .001 to maintain an overall α < .05 for rejecting the null hypothesis, uncalibrated wpmean scores correlated inversely with age (r = −.62, p < .001) and positively with education (r = .40, p < .001) and estimated IQ (r = .54, p < .001). Black participants produced lower wpmean T-scores than other participants (rpb = −.28, p < .001), but wpmean T-scores did not vary by sex. Uncalibrated wpSD, wpskew, and wpkurtosis did not correlate significantly with any participant characteristic. When test scores were calibrated for age, wpmean did not correlate with age or sex but did correlate with years of education (r = .46, p < .001), race (rpb = −.42, p < .001), and estimated IQ (r = .23, p < .001). As expected, fully calibrating test scores eliminated significant correlations between wpmean and every participant characteristic shown in Table 5. Age-calibrated wpSD correlated with race (rpb = .20, p < .001), as Black participants showed slightly greater variability across tests than other participants. However, this was true only for age-calibrated scores. No other calibrated within-person parameter correlated with any other participant characteristic.
Correlations Between Distributional Properties and Participant Characteristics.
Point-biserial correlations (Men [Sex] and White [Race] coded as the smaller value).
p < .001 (two-tailed); based on Bonferroni correction for multiple correlations to maintain α < .05.
To test the effects of test score calibration on wpmean, wpSD, wpskew, and wpkurtosis, we conducted three paired-samples t-tests (uncalibrated vs. age-calibrated, uncalibrated vs. fully calibrated, and age-calibrated vs. fully calibrated) for each parameter. The first three t-tests revealed no differences in wpmeans as a result of calibration level. However, the next three revealed substantial effects of calibration on wpSD (uncalibrated vs. age-calibrated t326 = −29.88, p < .001; uncalibrated vs. fully calibrated t326 = −38.46, p < .001; age-calibrated vs. fully calibrated t326 = −20.93, p < .001). These findings show that increasing the level of calibration tends to increase within-person performance variability across the tests administered. The third set of paired-samples t-tests showed that compared with uncalibrated scores, both age-calibrated (t326 = 6.81, p < .001) and fully calibrated (t326 = 5.82, p < .001) wpskew coefficients were larger. Age-calibrated and fully calibrated wpskew coefficients did not differ (t326 = −.59, p = .557), and no level of test score calibration significantly altered wpkurtosis.
Finally, we ranked all 30 test scores by the frequency with which each appeared as an individual’s lowest or highest score across the 327 participants to determine whether any individual measure(s) was especially likely to be high or low. There was large variability across individuals in terms of which test score was highest or lowest, with no specific test score being the lowest or highest for more than 12% of participants. This suggests that the patterns observed here are not unique to the specific tests administered.
Discussion
The primary aim of this study was to determine whether most healthy adults produce normally distributed scores regardless of their age, sex, race, years of education, IQ, and overall mean score on a battery of neuropsychological tests. To our knowledge, this and one other study of older adults with differing degrees of cognitive impairment (Reckess et al., 2014) are the first to examine IIV using this approach. As shown in Table 3, KS tests found that the within-person test score distributions of wpmean, wpSD, and wpskew were all normal, and this finding held regardless of how test scores were calibrated. The average wpmeans (49.8–50.1) and wpSDs (7.6–9.2) were close to their expected values, and the mean levels of wpskew and wpkurtosis were miniscule. No participant produced a wpskew coefficient that exceeded |2|, and 95% produced wpskew values between –1.1 and .9. Contrary to our hypothesis, KS tests showed that wpkurtosis distributions were not normal. However, just 10 to 16 of the 327 participants produced wpkurtosis coefficients that exceeded 2, and 95% produced wpkurtosis between –1.1 and 2.2 to 3.0, depending on test score calibration. The descriptive values shown in Tables 3 and 4 provide provisional markers of what to expect from reasonably healthy adults who complete a test battery with 27 to 30 cognitive measures. In short, 100% of healthy adults showed normal levels of wpskew and over 95% showed normal wpkurtosis. Thus, regardless of how their cognitive performance was calibrated, the overwhelming majority of study participants produced normal within-person test score distributions across the measures included in this test battery.
We found higher rates of slightly negative than positive wpskew, which averaged –.11 for both age-calibrated and fully calibrated test scores. This is consistent with prior research (Binder et al., 2009) and suggests that it is more common for healthy adults to produce a few very low scores than a few very high scores. Conversely, in a previous study, patients referred for dementia workups showed higher rates of positive than negative wpskew (Reckess et al., 2014). In that study, healthy older adults produced a mean wpskew of –.1 (which is nearly identical to the wpskew seen here), while patients referred for dementia workups produced wpskew coefficients that averaged +.1 to 1.2, and increased with dementia severity. Taken together, these findings suggest that healthy adults tend to produce mildly negative and lower wpskew coefficients than persons with dementia, whose wpskew coefficients turn increasingly positive as their illness progresses and their cognitive test performance worsens.
In the present sample, abnormal wpkurtosis was always positive, and no participant showed wpkurtosis below –1.3. Negative kurtosis denotes a flat (platykurtic) distribution. This occurs when most scores are represented an equal number of times. For example, if one plots the number of times a six-sided die lands on each side after 100 throws, the resulting distribution will be platykurtic. Such a pattern is hard to imagine for a cognitive test battery. None of the participants in this study showed marked platykurtosis, which suggests that doing so would be highly unusual. In the 3.1% to 4.9% of participants who showed abnormal wpkurtosis, it was always leptokurtic. This means that an unusually large number of their scores were close to their mean while a few were farther than expected from it.
Another goal of this study was to examine how within-person distribution parameters correlate with one another. These analyses revealed that most correlations among within-person parameters were weak and statistically insignificant. The three exceptions all involved age-calibrated scores and showed that wpmeans correlated inversely with wpSD and positively with wpskew, while wpskew correlated inversely with wpkurtosis. These finding suggest that participants who produced relatively higher wpmeans showed slightly less variability and slightly more negatively skewed test score distributions, and that negatively skewed distributions were more leptokurtic, although these correlations were weak and likely of little clinical significance.
Of greater clinical significance is the question of whether within-person distribution parameters vary by participant characteristics. These analyses showed that uncalibrated wpmean scores varied with age, race, years of education, and estimated IQ in the expected directions (Table 5). Fully calibrating test performance uncoupled the association of wpmean scores with every participant characteristic. Calibrating test performance for age alone strengthened the association between wpmean T-score and IQ, while fully calibrating test performance weakened the same association. This is consistent with expectation because IQ scores are calibrated only for age. White participants produced higher wpmean scores than Black participants using uncalibrated and age-calibrated scores, but the difference disappeared with full calibration. Importantly, wpSD, wpskew, and wpkurtosis showed no association with any participant characteristic, implying that these aspects of cognitive performance are reasonably independent of a person’s age, sex, race, educational background, and estimated premorbid ability. Thus, the within-person test score parameters that define distribution normality—wpskew and wpkurtosis—did not vary by any participant characteristic examined here.
We also assessed the impact of test score calibration on within-person distribution parameters with a series of paired-samples t-tests, which showed that degree of calibration had no effect on wpmean or wpkurtosis. Increasing the degree of calibration from none to age only to full calibration tended to broaden the within-person dispersion of test scores, as reflected by increasing wpSD from each level of calibration to the next. Also, both age and full calibration significantly lowered wpskew coefficients (from 0 to –.11) and increased the proportion of participants who showed negative wpskew compared with findings based on uncalibrated test scores. The impact of test score calibration on wpSD is interesting, but it does not affect within-person distribution normality, while the impact of test score calibration on wpskew does. However, Reckess et al. (2014) found that clinical patients tended to show higher rates of positive wpskew than healthy controls, that wpskew initially became increasingly positive as dementia severity worsened, and that average wpskew reached a zenith of 1.2 in the most severely demented patients tested in that study. Thus, using either age-calibrated or fully calibrated scores may serve to increase the sensitivity or specificity of using wpskew as a disease marker to the extent such calibration tends to decrease the wpskew coefficients of healthy adults.
In sum, the present findings indicate that a healthy adult’s performance across cognitive tests should approximate a Gaussian distribution in terms of within-person variance, skewness, and kurtosis, regardless of the person’s own test score mean, demographic background, and estimated premorbid IQ. If these results withstand cross-validation in other healthy samples and measures, then evaluating within-person test score distributions for departures from “normality” could offer a novel approach to clinical inference in neuropsychology. That is, computing an examinee’s test battery wpmean, wpSD, wpskew, and wpkurtosis, could enable the clinician or researcher to easily determine whether the distribution of a person’s cognitive performance is normal or abnormal. Prior research has found IIV to be a sensitive marker of underlying pathology in dementia (Holtzer et al., 2008; Reckess et al., 2014), Alzheimer’s disease (Duchek et al., 2009; Tractenberg & Pietrzak, 2011), Parkinson’s disease (de Frias et al., 2007), HIV (Arce Rentería et al., 2019;Vance et al., 2021), schizophrenia (Cole et al., 2011), and other conditions. Looking at wpskew and wpkurtosis in these clinical populations would complement the prior IIV research, and extend our understanding of their associated cognitive profiles.
One goal of the original ABC study (from which the current data were taken) was to establish norms for healthy adults across the adult age range. Volunteers were carefully screened for eligibility. Screening included physical and neurological examinations, psychiatric interviews, laboratory blood tests, and brain magnetic resonance imaging (MRI) scans. Thus, we are fairly confident in the composition of the sample. Of course, some participants could have had an undiagnosed disease or condition that affects cognitive function, possibly inflating the observed rates of abnormal wpskew or wpkurtosis. However, the finding that KS testing showed that no participant produced a wpskew greater than|1.6| argues against this for within-person skew. The presence of unrecognized disease could have contributed to the abnormal leptokurtosis shown by a few participants, but the fact that fewer than 4% of participants showed this pattern suggests that any such impact was likely minimal. Neither can we exclude the possibility that some aspect of our subject selection process depressed rates of abnormal within-person distributions, although we used a rigorous community sampling framework. Both possibilities highlight the need to replicate these findings in other healthy samples and measures.
A second potential limitation of this study is that we used a single battery of standard co-normed neuropsychological tests. It is possible that our findings are unique to this specific test battery. We tried to evaluate this possibility by examining the frequency with which each measure appeared as a person’s highest or lowest score. The resulting frequencies ranged from 1% to 12%. This range seems relatively small, especially since a person could have more than one “highest” or “lowest” score (e.g., two or more test scores tied for either position). Indeed, the total number of highest and lowest scores exceeds 100% of cases for precisely this reason. Thus, we found no single measure that tended to pull test score distributions in one direction or another. For this reason, while a different test battery could yield within-person test score distributions that are less Gaussian in shape, if the tests are scored using the same norms, and the sample consists of healthy adults, then it is reasonable to expect the present findings to withstand replication. Future research will be needed to determine the minimum number of cognitive tests or domains that need to be administered to obtain reliably normal within-person test score distributions.
Third, we did not formally assess performance validity, and some participants may have put forth suboptimal effort. However, study participants agreed to a full day of assessments that also included blood testing, brain imaging, neurological examination, and structured psychiatric interview, and most brought a family member or friend as a knowledgeable informant. These procedures likely served to discourage who were not quite motivated to join the study, and this belief was supported by the comments of many participants. In addition, we found very low rates of failing embedded performance validity indices for the tests administered. Just one participant (0.3%) scored below 6 on the Brief Test of Attention (BTA) (Busse & Whiteside, 2012) and two (0.6%) scored below 6 on the Hopkins Verbal Learning Test—Revised (HVLT-R) delayed Recognition Discrimination index (Bailey et al., 2018; Sawyer et al., 2017). Slightly more (4.3%) scored below demographically calibrated T-scores (using CNNS) of 34 for the Trail-Making Test Parts A and B (4.3% and 3.4%, respectively) or 30 for the Grooved Pegboard Test with their dominant and non-dominant hands (1.7% and 1.4%, respectively) (Abeare et al., 2019; Erdodi et al., 2018; Erdodi & Lichtenstein, 2021; Jinkerson et al., 2023; Link et al., 2022). These rates of performance validity test (PVT) failure are all lower than expected based on the specificity rates for all six of these embedded performance validity measures. For these reasons, we doubt that rates of invalid cognitive performance contributed significantly to the findings reported here.
Finally, this study does not clarify whether the few participants who produced elevated wpkurtosis did so as a result of more trait-like or state-like factors. If a small percentage of healthy persons reliably produce leptokurtic within-person distributions on cognitive testing, this might denote a pattern of “splinter” strengths or weaknesses that is atypical but not abnormal. On the other hand, if a small percentage of neurologically healthy persons produce leptokurtic within-person test score distributions, but the individuals comprising this subgroup differ over time, this could indicate that doing so reflects clinically meaningless “noise” in testing. This again underscores the need for replication, ideally using a design in which participants are tested twice to examine the stability of any observed departures from normality.
Conclusions and Future Directions
To our knowledge, this is the first demonstration that most healthy adults produce normal within-person distributions on a standardized neuropsychological test battery. If this proves to be a replicable and generalizable phenomenon, examining parameters of within-person test score distributions could offer a new, complementary approach to the diagnosis of cognitive disorders, above and beyond traditional deficit measurement. Just as researchers must inspect the normality of data before proceeding with statistical analyses, the present findings suggest that it may be important for clinicians to look for departures from normality of the within-person test score distributions of patients referred for assessment before interpreting a few low scores. In the research setting, these findings raise the intriguing possibility that an older adult diagnosed with mild cognitive impairment (MCI) based on a few low scores in an otherwise normal distribution might be at much lower risk of converting to dementia than a person with the same low scores in a clearly abnormal within-person distribution. For example, if the hypothetical person depicted in Figure 1B produced a few T-scores between 25 and 35, this might represent a higher risk of conversion to dementia than the same low scores produced by the hypothetical person depicted in Figure 1C.
Finally, in addition to its potential as a novel method of inference in neuropsychology, the fact that most healthy adults produce Gaussian within-person test score distributions is fascinating in its own right. At its core, this finding speaks to the fundamental way in which human beings are endowed with mental abilities. Just as diverse mental abilities are normally distributed in the population at large, it appears that diverse mental abilities are distributed normally within each member of that population.
Footnotes
Acknowledgements
The authors gratefully acknowledge Dr. Jason Brandt for creating the archival database from which data for the clinical samples were drawn.
Declaration of Conflicting Interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Psychological Assessment Resources, Inc., D.J.S. is entitled to a share of royalties on sales of a test and software used in the study described in this article. The terms of this arrangement are being managed by the Johns Hopkins University in accordance with its conflict of interest policies.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Therapeutic Cognitive Neuroscience Professorship (B.G.); the Therapeutic Cognitive Neuroscience Fund (B.G.); the Benjamin and Adith Miller Family Endowment on Aging, Alzheimer's, and Autism (B.G.); the William and Mary Ann Wockenfuss Research Fund Endowment (B.G.); and United States Department of Health and Human Services, National Institutes of Health, National Institute of Mental Health Grant MH60504 (D.J.S.).
