Abstract
Background:
Coupling digital technology with traditional neuropsychological test performance allows collection of high-precision metrics that can clarify and/or define underlying constructs related to brain and cognition.
Objective:
To identify graphomotor and information processing trajectories using a digitally administered version of the Digit Symbol Substitution Test (DSST).
Methods:
A subset of Long Life Family Study participants (n = 1,594) completed the DSST. Total time to draw each symbol was divided into ‘writing’ and non-writing or ‘thinking’ time. Bayesian clustering grouped participants by change in median time over intervals of eight consecutively drawn symbols across the 90 s test. Clusters were characterized based on sociodemographic characteristics, health and physical function data, APOE genotype, and neuropsychological test scores.
Results:
Clustering revealed four ‘thinking’ time trajectories, with two clusters showing significant changes within the test. Participants in these clusters obtained lower episodic memory scores but were similar in other health and functional characteristics. Clustering of ‘writing’ time also revealed four performance trajectories where one cluster of participants showed progressively slower writing time. These participants had weaker grip strength, slower gait speed, and greater perceived physical fatigability, but no differences in cognitive test scores.
Conclusion:
Digital data identified previously unrecognized patterns of ‘writing’ and ‘thinking’ time that cannot be detected without digital technology. These patterns of performance were differentially associated with measures of cognitive and physical function and may constitute specific neurocognitive biomarkers signaling the presence of subtle to mild dysfunction. Such information could inform the selection and timing of in-depth neuropsychological assessments and help target interventions.
Keywords
INTRODUCTION
Neurodegenerative processes underlying Alzheimer’s disease and other related dementias begin years, if not decades, before they are clinically evident [1, 2]. It is, therefore, imperative that clinical neuropsychologists have assessment tools that are able to detect emergent neurodegenerative change as early as possible so that research participants and patients referred for clinical evaluation can be correctly classified and diagnosed for needed services, including recruitment into possible clinical trials and future disease-modifying interventions.
The goal of identifying early neurodegenerative alterations is consistent with the Boston Process Approach (BPA) of neuropsychological assessment, which argues that the errors and strategies an individual uses to complete the task at hand is often more informative in elucidating underlying brain-behavior relationships compared to a simple, single summary test score [3, 4]. For example, a person with subtle brain changes may achieve a summary score that is statistically within the normal or in the acceptable range, but only with considerable effort, along with many self-corrected errors. Traditional, quantitative scoring would classify such performance as unimpaired. However, the ability to quantify the process or how the task was completed could reveal subtle neurocognitive inefficiencies. The BPA explicitly emphasizes the multifactorial nature of neuropsychological tests in that most tests rely on multiple skills for successful completion. Therefore, impaired performance may be due to a multitude of reasons. As such, the BPA promotes meticulous recording of a person’s performance as a means to identify underlying affected cognitive processes [4]. Digital assessment technology facilitates the application of the BPA and thus, has the potential to be a more sensitive and specific method of detecting subtle, emergent cognitive impairment, particularly among those presumed to be asymptomatic based on traditional measures.
The Digit Symbol Substitution Test [5] or the Digit Symbol Test (DSST) is one of the original subtests used by Wechsler to construct his well-known intelligence scales. It is a timed test assessing graphomotor information processing speed where participants are asked to fill in boxes with a symbol from a number-symbol key at the top of the page as quickly as possible. The traditional test score is calculated on the basis of the total number of correctly transcribed symbols in 90 or 120 s. This test is well-known to be one of the most sensitive tests for detecting cognitive dysfunction [6, 7] because successful performance requires the coordination of multiple domains, including cognitive (e.g., attention, processing speed, episodic memory, and perception) and non-cognitive factors (e.g., motor speed and dexterity, visual scanning, and motivation). Deficits in any of these abilities could contribute to overall DSST test difficulty [6, 8]. However, the multifactorial nature of the DSST test also contributes to the test’s lack of specificity as to the source of impairment [6].
In line with implementing a process approach that might disambiguate the source of difficulty on this test, additional DSST subtests and testing procedures have been utilized [9]. For example, a symbol copy task asks participants to simply copy, rather than transcribe, a series of symbols to isolate the contribution of motor deficits. An incidental learning task is available where individuals write the digit symbol pairs that were learned after completing the original coding task to shed light on whether impaired associative learning may have hindered performance. Finally, to identify change in test performance, recording performance in 30-s intervals can capture deficits in establishing set (i.e., understanding and engaging in the task) and fatigue over the course of the test. Adding these subtests, however, increases testing burden and adds more time-consuming post-scoring procedures.
Interfacing standard DSST performance with digital pen technology obviates many of these problems (see [10] for review). The integration of recording output using a digital pen during neuropsychological test performance allows the examiner to capture data that is rich in process information that would otherwise be labor-intensive or impossible to collect. Digital pens can be used in place of an ordinary ballpoint pen to record time-stamped data throughout the task. To date, the majority of published work reporting use of the digital pen with traditional neuropsychological tests has been with the Clock Drawing Test (CDT). Software parses the total time to completion into time spent actually drawing (called ‘ink time’) versus time not spent drawing (called ‘think time’ [11]). Studies show that digitally captured ‘think time’ data can differentiate cognitively healthy adults from participants with mild cognitive impairment (MCI) or early dementia [12], cerebral small vessel disease [13], and depression [14]. Metrics extracted from the digital pen data are also able to capture additional facets of graphomotor production (e.g., number of pen strokes) and processing speed (e.g., decision-making latencies prior to individual components that are drawn) that differ by age [15] and clinical population [16, 17].
Here, we present an analysis of performance on the DSST captured using a digital pen among participants in a study of longevity. The DSST involves real-time learning of novel information whereas previous analyses of digital pen data have focused on the CDT which draws upon well-learned information (i.e., semantic knowledge of a clock). We hypothesized that digital DSST data would allow us to identify participants who get faster or slower during the test. Different thinking and writing time patterns of behavior may be associated with cognitive and physical functions that underlie test performance. Therefore, the aims of this study were: 1) to determine whether there are distinct patterns of performance for ‘thinking time’ and ‘writing time’ during the test; and 2) to assess whether and/or how these patterns correlate with measures assessing neurocognitive abilities versus health and functional outcome measures.
MATERIALS AND METHODS
Participants
Participants were from the Long Life Family Study (LLFS), a family study of healthy aging and longevity. Details of participant selection and study measures have been previously described (see [18, 19]). Briefly, participants were selected based on evidence of familial longevity as determined by a family longevity selection score which rated the survival exceptionality of all members of the proband generation [19]. Families selected to participate had at least two siblings in the proband generation and an individual in the offspring generation, who all had capacity to provide consent and were willing to participate in an in-home assessment. All interested family members from the proband and offspring generations were recruited. Informed consent was obtained from all study participants. To date, 4,953 participants from 539 families have completed up to two in-person assessments approximately eight years apart: Visit 1 (performed 2006–2009) and Visit 2 (performed 2014–2017). At Visit 1 participants in the proband generation were 90 years of age on average (range 55 to 110 years) and those in the offspring generation were 61 years of age (range 25 to 88 years). All participants provided written informed consent prior to participation. The Institutional Review Boards at all of the Field Centers (Boston University, Columbia University, and University of Pittsburgh) and the Data Management and Coordinating Center (Washington University, St. Louis) in the United States reviewed and approved this project and the regional Institutional Review Board in Denmark reviewed and approved this project.
Health and function data
At each visit, participants completed questionnaires to provide sociodemographic data, self-reported diagnoses of medical conditions, and ability to perform activities of daily living. The questionnaires also included the Pittsburgh Fatigability Scale [20, 21] (at Visit 2 and annually thereafter), a measure of perceived physical and mental fatigability in older adults with higher scores (0–50 for both subscales) reflecting greater fatigability. Health and function assessments included the Short Physical Performance Battery (comprising balance assessment, gait speed, and chair stands), grip strength, anthropometrics (e.g., height, weight, and abdominal circumference), blood pressure, spirometry, a carotid ultrasound (at Visit 2 only), a cognitive assessment, and the 10 item Center for Epidemiologic Studies Depression Scale (CES-D). Spirometry was used to measure lung function (i.e., forced expiratory volume during the first second of exhalation, FEV1). The carotid ultrasound examination was used to measure intima-media thickness (IMT) of the common carotid artery, a marker of subclinical vascular disease. Examiners also recorded the participant’s current medications. A blood draw was performed for genetic and laboratory analyses and biobanking at the central laboratory of the LLFS at the University of Minnesota. Participants were contacted every one to three years to collect information on vital status, changes in health, medications, and daily activities and to administer a brief test of cognitive function, a modified version of the Telephone Interview for Cognitive Status (TICS) [22, 23].
Cognitive assessment
At Visit 2 the cognitive assessment included the Mini-Mental Status Examination (MMSE), Hopkins Verbal Learning Test –Revised (HVLT), the Clock Drawing Test (CDT) command and copy conditions, Trail Making Tests Parts A and B, Logical Memory from the Wechsler Memory Scale –Revised, Number Span Test from the National Alzheimer’s Coordinating Center Uniform Data Set, Digit Symbol Substitution Test (DSST) from the Wechsler Adult Intelligence Scale –Revised, semantic fluency (animals), and phonemic fluency (i.e., the Controlled Oral Word Association Test; letters F,A,S) and the assignment of a Clinical Dementia Rating score. Participants used a digital pen for all written responses to test items (e.g., MMSE pentagons and sentence, CDT, Trails A&B, DSST). The digital pen looks and writes like a wide-body ballpoint pen but has a small digital camera aimed at the tip of the pen that is able to capture movements across specially formatted paper. The camera records x and y coordinates of the pen’s position on the paper at 13-ms intervals allowing for playback of the participant’s process of drawing or writing. Digital pen data for the DSST were analyzed in this study. DSSTs with a score of zero, or noted to be affected by hearing, vision, inability to write, participant refusal, experimenter error, environmental distractions, or physical limitations (n = 62) were omitted from the analyses.
Extraction of digital metrics
An in-person Visit 2 assessment with digital pen data collection of the DSST was completed by 2,634 participants. A loss of 303 digital pen files occurred due to improper storage and therefore, a total of 2,331 unique DSST digital pen files were converted to.txt files that included separate records for the coordinates and time for each pen stroke, where a new stroke started each time the pen was placed on the paper. We prioritized the extraction of thinking time, which based on evidence from other digitally-administered tests, may be a valuable metric for detecting differences in cognition in aging [15] and early Alzheimer’s disease [12]. In line with the BPA, we aimed to capture behaviors that are elicited specifically by the DSST; and, therefore focused on the analysis of change in performance over a repetitive task which may inform about set engagement, associative learning, and fatigue [3]. To assign each stroke to one of the boxes of the DSST exam grid, the exam grid was first traced using the digital pen to generate the x and y coordinates of the corners of each of the 93 boxes and then these x and y coordinates were widened by 5%to account for pen strokes written slightly outside a box (Fig. 1). Starting from the first box, each pen stroke was then assigned to a box if at least 75%of the graphic output for a stroke (i.e., the x and y coordinates sampled at 13-ms intervals) was in the box. Participants who had strokes that could not be assigned to a box or filled in boxes out of order, were excluded from the analysis. There were 23 participants who were excluded because they did not complete any segments (i.e., completed less than 8 boxes). In total, 1,594 participant files were included in the analysis. As shown in Supplementary Table 1, participants whose pen files were excluded were older (mean age 74.7±12.3 versus 70.2±10.0 years, p < 0.0001), and more likely to be female (58.2%versus 53.4%, p = 0.03). Excluded participants also differed in terms of educational attainment (p = 0.02) and field center (p < 0.0001). Excluded participants did not differ significantly on any of the other health or functional characteristics examined in the following analyses.

Assignment of pen strokes to boxes.
Next, the ‘thinking’ or non-writing time prior to the transcription of the response into each box and ‘writing’ time for each box were determined. The ‘thinking’ time prior to each box was defined as the time from the lifting of the pen in the prior box to the placement of the pen in the next box. Thinking time was not assigned to the first box. The ‘writing’ time for a box was defined as the sum of the time writing all strokes assigned to the box.
Statistical analysis
Bayesian model-based clustering was used to group participants based on 1) median ‘thinking’ time over non-overlapping segments of eight consecutive boxes, and 2) median ‘writing’ time over the same segments of eight consecutive boxes. The number of completed segments varied across participants (range 1–10) (distribution shown in Supplementary Table 2). The advantage of Bayesian model-based clustering is that it allows for analyzing trajectories of differing lengths (e.g., number of completed segments of consecutive boxes), it provides parametric forms of the cluster-specific trajectories, and inference on the parameters is based on their marginal distribution rather than their conditional distribution that depends on nuisance parameters [24]. The outcome in this analysis was based on the median ‘thinking’ time or ‘writing’ time over all segments of multiple consecutive boxes to smooth the data. Segment lengths of four and eight boxes were considered, and eight box segments were chosen for analysis as they sufficiently smoothed the data. Outcome variables were normalized by subtracting the median time for the first segment from the median time of each subsequent segment to assess patterns of change independent of the initial time.
Linear regression was used to model the outcome as a function of the segment number, assessing both linear and quadratic models. No intercept term was included as the first value of the outcome was zero for all participants by design. Accordingly, the initial zero values were omitted from the regressions. In model-based clustering, the coefficients for each term in the model are allowed to differ by cluster and uninformative priors were assumed for all parameters. For a fixed number of clusters, the model was fit via Markov Chain Monte Carlo using 5,000 samples after discarding the first 1,500 samples. Model fitting started with both linear and quadratic models assuming two clusters, and then the number of clusters was increased until the deviance information criterion (DIC) began to increase. The best fitting linear model was compared with the best fitting quadratic model, and the model with the lower DIC was selected as the final model. DIC is ideally suited for selection of mixture models with hidden variables because of the correction by the effective number of parameters [25]. Summary distributions of the cluster membership suggest that cluster allocation was very accurate, see Supplementary Table 3.
Cluster membership was assessed for associations with sociodemographic characteristics, health and physical function data, APOE genotype, and neuropsychological test scores. All comparisons were adjusted for age, sex, education, and field center (Boston, Denmark, New York, or Pittsburgh). Continuous variables were compared by fitting linear models, and categorical variables by fitting logistic regression models. To account for family relatedness, these models were fit using generalized estimating equations with an exchangeable working correlation matrix. Time to death was compared by fitting a Cox Proportional Hazards model, adjusting for age, sex, education, and field center, and using a robust sandwich estimate of the covariance matrix. Change in TICS score (as scored in [26] which includes tasks of counting backwards, subtracting sevens, and immediate and delayed recall of a word list, score range 0–27) over follow-up after the Visit 2 assessment was modeled with a linear mixed effects model, adjusting for age, sex, education, field center, follow-up time, cluster, and an interaction between cluster and follow-up time. A random intercept and slope for follow-up time were included to account for the repeated measures. Finally, agreement between thinking time and writing time cluster membership was assessed using a weighted kappa statistic with linear weights. Analyses were conducted in RStudio, using the package rjags, and SAS v9.4.
RESULTS
Table 1 shows demographic characteristics, DSST scores, and digital metrics for the 1,594 participants included in the analysis. Participants were 70 years of age on average, ranging from 45 to 103 years, 52%female, predominantly non-Hispanic, white (99%), highly educated (61%with a college degree or advanced degree), and generally cognitively healthy (median MMSE score = 29.0, IQR 29.0-30.0). Average ‘thinking’ time and ‘writing’ time were strongly associated with DSST raw score (Pearson correlation coefficient –0.79 and –0.64, respectively, both p < 0.0001). Supplementary Table 3 shows that DSST raw score, average ‘thinking’ time, and average ‘writing’ time were generally associated with all of the health and functional characteristics, with the exception of grip strength and gait speed.
Participant characteristics of the Long Life Family Study sample, n = 1,594
*64 participants were missing ethnicity data. ∧Reported as Median (Q1-Q3). MMSE, Mini-Mental State Examination, maximum score is 30; DSST, Digit Symbol Substitution Test.
Thinking time trajectories
Based on DIC, a model with four clusters and a linear term for segment number was the best fit to the ‘thinking’ time data (Fig. 2). The four patterns of ‘thinking’ time performance trajectories across the 90 s of the DSST are shown in Fig. 3. Most participants had relatively stable trajectories with only minor within-task changes in ‘thinking’ time and fell into one of two clusters; individuals in cluster 2 (n = 476) had a slight increase in ‘thinking’ time and individuals in cluster 3 (n = 857) had a slight decrease in ‘thinking’ time (Table 2). In contrast clusters 1 and 4 showed greater within-task changes in ‘thinking’ time. Cluster 1 (n = 168) comprised “slowing thinkers”, individuals who performed at an average rate at the beginning of the test and increased their ‘thinking’ time across subsequent intervals. Cluster 4 (n = 93), the “quickening thinkers”, showed the opposite pattern with a significantly longer ‘thinking’ time in the first segment (1.9 versus 1.1 s, p < 0.0001), and a decrease in ‘thinking’ time across the intervals.

Deviance Information Criterion (DIC) values for number of model clusters.

Thinking time trajectory clusters. Clusters of participants with similar patterns of change in thinking time per segment during the Digit Symbol Substitution Test (color/black lines) and their fitted trajectories (black lines) are shown overlaid on the data for all participants (gray lines). The intercepts for the fitted trajectories were chosen so that the lines pass through the center of the data for each cluster.
Regression coefficients and 95%credible intervals from model-based clustering of thinking and writing time, n = 1,594
Regression coefficients from models of thinking time or writing time as a linear function of segment number.
Descriptive characteristics of the ‘thinking’ time trajectory clusters (Table 3) revealed similar health and functional parameters between the two stable clusters (clusters 2 and 3) as well as between the clusters of slowing and quickening thinkers (clusters 1 and 4). The clusters of slowing and quickening thinkers (clusters 1 and 4, respectively) were older (p < 0.0001), had a lower DSST raw score (p < 0.0001) and HVLT Total Recall score (p = 0.04), took longer to complete the CDT (command condition p = 0.01; copy condition p < 0.001, results not shown), and trended toward lower scores on HVLT Delayed Recall (p = 0.05). These clusters did not differ on measures of physical function. Over a median of 2.6 years of follow up, the cluster of quickening thinkers (cluster 4) showed a significant decline in TICS of about half a point per year (–0.64, SE 0.29, p = 0.03) with the slowing thinkers (cluster 1) showing a similar trend (–0.45, SE 0.24, p = 0.06). In reference to cluster 3 (i.e., the stable-quickeners), the quickening thinkers (cluster 4) had a significantly lower mortality rate (adjusted hazard ratio [aHR] 0.33, 95%confidence interval [CI] 0.12–0.94, p = 0.04) whereas the other clusters had similar mortality to cluster 3.
Health and functional characteristics of thinking time trajectory clusters
Mean±SD for continuous variables; counts (%) for categorical variables. p-values for comparisons adjusted for age, sex, education level, and field center. ∧Median (Q1-Q3). p-value based on log-transformed variable. *Modeling time to death, adjusted for age, education level, and field center and stratified on sex. **Modeling TICS score as a function of age, sex, education level, field center, follow-up time, cluster, and an interaction of cluster and follow-up time. The p-value in each column is for a test that the slope is equal to zero. N = 1,014, as the Denmark field center did not perform the TICS exam. †p-value for cluster *follow-up time interaction. PFS, Pittsburgh Fatigability Scale; FEV1, Forced expiratory volume; IMT, intima-media thickness; MMSE, Mini-Mental State Examination; DSST, Digit Symbol Substitution Test; HVLT, Hopkins Verbal Learning Test-Revised; CDT, Clock Drawing Test; CESD, Center for Epidemiologic Studies Depression Scale; TICS, Telephone Interview for Cognitive Status.
Writing time trajectories
Similar to ‘thinking’ time, a model with four clusters and a linear term for segment number was the best fit to the ‘writing’ time data based on DIC (Fig. 2). Most participants demonstrated relatively stable ‘writing’ time across the test represented by clusters 2 (n = 181) and 3 (n = 1142) as shown in Fig. 4. A small number of individuals (n = 26, cluster 1) demonstrated slowing ‘writing’ speed across the test, spending more time writing in the last segment of 8 completed boxes than during the first segment. Cluster 4 (n = 245) showed a quickening trajectory in which individuals spent more time writing symbols in the first segment of 8 boxes in comparison with individuals in the other three clusters and then spent less and less time writing symbols as the test progressed.

Writing time trajectory clusters. Clusters of participants with similar patterns of change in writing time per segment during the Digit Symbol Substitution Test (color/black lines) and their fitted trajectories (black lines) are shown overlaid on the data for all participants (gray lines). The intercepts for the fitted trajectories were chosen so that the lines pass through the center of the data for each cluster.
Health and functional characteristics were generally similar between the two stable ‘writing’ time trajectory clusters (cluster 2 and 3) except that cluster 2 contained a greater proportion of females than cluster 3 (67%and 52%respectively, Table 4). Both cluster 1, the slowing writers, and cluster 4, the quickening writers were older and had poorer DSST raw scores (both p < 0.0001) yet there were no differences between clusters on the other neuropsychological tests performed at Visit 2. Perceived physical fatigability was the only health and functional characteristic that differed between all clusters (p = 0.03). Analysis of health outcomes found that cluster 1 had a greater annualized decline in general cognitive ability (i.e., TICS score) over follow up (–1.83, SE 0.74, p = 0.02), and cluster 4, the quickening writers, had a significantly higher risk of mortality over follow up (aHR 1.70, CI 1.03–2.81, p = 0.04). As physical function values were notably lower for cluster 1 but may have been obscured by the small sample size of this cluster in the primary analysis, secondary analyses were performed to compare all health and functional characteristics for cluster 1 to clusters 2–4. Cluster 1 had the poorer performance in several domains including grip strength (p = 0.02), gait speed (p = 0.02), and perceived physical fatigability (p = 0.008).
Health and functional characteristics of writing time trajectory clusters
Mean (SD) for continuous variables; counts (%) for categorical variables. p-values for comparisons adjusted for age, sex, education level, and field center. ∧Median (Q1-Q3). p-value based on log-transformed variable. *Modeling time to death, adjusted for age, education level, and field center and stratified on sex. **Modeling TICS score as a function of age, sex, education level, field center, follow-up time, cluster, and an interaction of cluster and follow-up time. The p-value in each column is for a test that the slope is equal to zero. N = 1,014, as the Denmark field center did not perform the TICS exam. †p-value for cluster *follow-up time interaction. PFS, Pittsburgh Fatigability Scale; FEV1, Forced expiratory volume; IMT, intima-media thickness; MMSE, Mini-Mental State Examination; DSST, Digit Symbol Substitution Test; HVLT, Hopkins Verbal Learning Test-Revised; CDT, Clock Drawing Test; CESD, Center for Epidemiologic Studies Depression Scale; TICS, Telephone Interview for Cognitive Status.
Cluster concordance
Figure 5 shows concordance between ‘thinking’ time and ‘writing’ time clusters. Overall, 47%of individuals had concordant ‘thinking’ and ‘writing’ time trajectories (weighted kappa 0.02, CI –0.01–0.05). Concordance ranged from 0.1%for individuals classified as both slowing thinkers and slowing writers (cluster 1 for both trajectories) to 41.5%for individuals who were classified as stable-quickening for both ‘thinking’ and ‘writing’ time trajectories (cluster 3). Out of the 261 individuals in the clusters that had the lowest DSST raw scores and the greatest amount of change in ‘thinking’ time (clusters 1 and 4), only 78 (30%) also fell into clusters of ‘writing’ time that showed significant increases or decreases during the test. Notably, discordant change patterns were common; 33%of slowing thinkers (cluster 1) were quickening writers (cluster 4) and 46%of slowing writers (cluster 1) were quickening thinkers (cluster 4).

Participant overlap between thinking time and writing time clusters.
DISCUSSION
Novel digital metrics revealed differences in performance patterns (i.e., changes in ‘thinking’ time and ‘writing’ time) during a brief neuropsychological test that have distinct cognitive and physical function correlates. We identified subsets of individuals who showed significant changes in ‘thinking’ speed who also had lower episodic memory test scores but did not differ in physical function. In contrast individuals who had decreasing ‘writing’ speed did not differ in cognitive scores but instead had poorer perceived physical but not mental fatigability as well as poorer gait speed and grip strength upon secondary analyses. Taken together, coupling digital technology with traditional neuropsychological test administration suggests that metrics measuring ‘thinking’ time versus ‘writing’ time confer information about the processes underlying overall summary test scores that provides clinically meaningful information beyond traditional test scores.
The low concordance of ‘thinking’ time and ‘writing’ time is consistent with the tenets underlying the BPA and underscores the added value of digital DSST metrics. Poorer DSST scores were associated with clusters that demonstrated significant within-task changes in ‘thinking’ time or ‘writing’ time, yet, only 30%of individuals who had an unstable ‘thinking’ trajectory also had an unstable ‘writing’ trajectory and many had opposing trajectories (e.g., slowing thinking and quickening writing). This suggests that by parsing out ‘thinking’ time from ‘writing’ time we were able to uncover differential processes contributing to lower DSST scores. Some individuals have poorer scores due to differences in the information processing component of the test, whereas other individuals have poorer scores due to graphomotor components. This further strengthens the notion that overall scores for test performance on the DSST confound the source of process-specific deficits and omit information that may have clinical utility.
Poor DSST performance as measured by traditional scoring has been associated with a higher risk for future cognitive impairment and dementia [27, 28]. Yet, in this study we found that only certain ‘thinking’ time and ‘writing’ time trajectories were associated with incident cognitive decline of more than half a point per year on the TICS. Interestingly it was the quickening thinkers (cluster 4), but also the slowing writers (cluster 1) who showed significant cognitive declines over follow up with slowing thinkers (cluster 1) showing a similar trend. In comparison, a ten-point lower raw score on the DSST was associated with a decline of 0.2 points per year on the TICS (Supplementary Table 4). This suggests that impairments in engaging in cognitive tasks and maintaining motor speed may be early, subtle neurocognitive biomarkers suggesting impending cognitive decline.
These ‘thinking’ time and ‘writing’ time trajectories were identified among healthy aging research study participants but may have utility for individuals presenting to a memory clinic. The development of a classifier to predict an individual’s ‘thinking’ time and ‘writing’ time cluster based on their digital DSST data could indicate their risk of related cognitive and physical deficits as well as incident cognitive decline. This information could signal to clinicians that in-depth testing is warranted and inform whether neuropsychological assessments or physical/neurological assessments may be of greater value. This increased specificity would also be valuable in identifying whether therapeutic interventions for each individual should be targeted toward cognitive or physical enhancement.
Neuropsychological correlates of performance trajectories
Digital DSST ‘thinking’ time clusters associated with cognitive abilities that underlie performance on this test including episodic memory and visuospatial function (CDT). Although prior research has concluded that episodic memory has only a negligible contribution to overall DSST performance estimated at 5–7%[29], our results suggest that for a subset of individuals who show change in ‘thinking’ time over the course of the test, poorer episodic memory function may be related to poorer DSST performance. Lower performance on the learning trials of the HVLT may indicate deficits in attention or engaging in the DSST task, whereas performance on delayed recall may reflect difficulty remembering the task instructions or deficits in associative memory (i.e., learning the symbol-digit pairs). Interestingly, the association of cluster membership with differences in episodic memory was only evident on the list learning task and was not seen for paragraph recall. A notable difference between the two tests is that the HVLT word list is repeated three times allowing for additional opportunities to learn the stimuli and implement strategies to more effectively remember word lists which may relate more directly to an individual’s ability to learn the symbol code on the DSST.
DSST ‘thinking’ time but not ‘writing’ time clusters were associated with digitally obtained measures from the CDT including total time, thinking time, and writing time. As the CDT is also a multifactorial test drawing upon cognitive functions including visuospatial function, working memory, conceptualization, and organization, it is difficult to ascertain which specific underlying constructs relate to slower performance on both tests. Additionally, one would expect slow graphomotor speed to affect writing and drawing tests similarly, yet, the DSST ‘writing’ time trajectories were not associated with CDT writing time. One of the fundamental differences between the CDT and DSST is that the CDT accesses information already stored in semantic memory (i.e., shape of the outer circle, placement of the numbers, and time setting) and is completed without time restrictions. In contrast, the DSST is a timed test that requires establishing and maintaining mental set on a non-automatized test with greater dependence on visual scanning. Additionally, participants often self-correct or redraw components on the CDT to conform to their semantically-stored image, which results in longer writing times and thereby contaminates writing time with thinking. This disassociation may also reflect differences in the constructs of writing speed as compared to change in writing speed. Additional studies are needed to deconstruct the multiple cognitive skills captured in writing time and how these constructs vary across tests.
The association of ‘thinking’ time clusters with neurocognitive, but not physical function correlates, with the exception of CDT writing time, suggests that the isolation of ‘thinking’ time (i.e., non-writing time) was able to reduce the contribution of motor function associated with DSST performance [29] and better indicate the contribution of cognitive processing. However, it may still be confounded by motor demands related to visual scanning [30], and incorporation of measures of eye tracking [31] could help refine the ‘thinking’ time metric as a measure of information processing.
Writing time trajectories may also provide better metrics of graphomotor speed than traditional DSST scores. Comparisons of DSST scores with performance on a simpler symbol copy task in other studies have shown that approximately 50%of the variance in DSST scores is related to psychomotor speed and that this correlation increases with age [29]. The dissociation of ‘writing’ time from ‘thinking’ time in the current analysis, rather than gross performance on a symbol copy test, might be a more direct measure of graphomotor speed as the symbol copy task is confounded by perceptual speed [29].
Strengths and limitations
The strengths of this study lie in the wide age range of participants (45 to 103 years), the availability of in-depth phenotyping across health domains, including physical function tests, measures of vascular health, and a multi-domain neuropsychological protocol, as well as the fact that this is a healthy aging cohort which may be particularly valuable for identifying early, subtle markers of cognitive impairment. One limitation of the study is that the LLFS cohort is a selected sample of individuals with familial longevity and thus a predisposition for healthy aging. This may have biased our results toward negative findings for associations of ‘thinking’ and ‘writing’ time patterns with stroke, hypertension, diabetes, atrial fibrillation, or subclinical measures of vascular health (i.e., carotid IMT) which have been previously associated with DSST performance [32, 33]. Additionally, participants are highly educated and predominantly non-Hispanic white and therefore analyses of DSST performance trajectories in other demographic and clinical cohorts is warranted. Secondly, the cognitive protocol does not include tests of visual scanning or sustained attention which may have revealed differential associations with ‘thinking’ time clusters. Thirdly, although performance on the DSST and similar coding tests have been noted to be better predictors of death among older adults than other cognitive tests [34–36], and chronological age [37], the differences in mortality risk for some clusters needs to be examined with caution as the period of follow-up was relatively short (median 2.6 years), the number of deaths was low (n = 82), and other confounders of mortality risk should be considered. Finally, these analyses did not include adjustment for multiple comparisons and thus should be replicated in larger study populations.
CONCLUSIONS
Integration of digital technologies with standard neuropsychological testing protocols allows for the capture of high-precision process data that would not otherwise be able to be captured by the examiner. This is particularly valuable for the administration of brief, highly sensitive, multifactorial tests as the digital data may allow for differentiation of the contributions of dysfunction in specific underlying cognitive and motor processes, data that are confounded in overall test scores. Thus, digital metrics, such as the ‘thinking’ and ‘writing’ trajectories identified in this study, may specifically inform the selection and timing of in-depth neuropsychological assessments or could be a marker for other subclinical risk factors associated with the disablement pathway (e.g., fatigability) and may help target appropriate interventions.
Footnotes
ACKNOWLEDGMENTS
We would like to thank Ben Wasserman and the Framingham Heart Study for sharing PowerShell scripts that aided in the conversion of the digital pen files. We would also like to thank the extraordinary participants of the Long Life Family Study who make this work possible.
This work was supported by the National Institute on Aging (K01AG057798 to S.L.A., U01AG023749 to S.C., U01AG023755 to T.T.P., U01AG023712, U01AG023744, U01AG023746, U19AG063893); the National Institute of General Medical Sciences Interdisciplinary Training Grant for Biostatisticians (T32 GM74905) to B.S.; the Boston University School of Medicine Department of Medicine Career Investment Award to S.L.A.; and the Marty and Paulette Samowitz Foundation to T.T.P. Additionally, the Claude D. Pepper Older Americans Independence Center, Research Registry and Developmental Pilot Grant (NIH P30 AG024827), and the Intramural Research Program, National Institute on Aging supported N.W.G. to develop the Pittsburgh Fatigability Scale.
