Abstract
Mobile technologies are increasingly used to measure cognitive function outside of traditional clinic and laboratory settings. Although ambulatory assessments of cognitive function conducted in people’s natural environments offer potential advantages over traditional assessment approaches, the psychometrics of cognitive assessment procedures have been understudied. We evaluated the reliability and construct validity of ambulatory assessments of working memory and perceptual speed administered via smartphones as part of an ecological momentary assessment protocol in a diverse adult sample (N = 219). Results indicated excellent between-person reliability (≥0.97) for average scores, and evidence of reliable within-person variability across measurement occasions (0.41-0.53). The ambulatory tasks also exhibited construct validity, as evidence by their loadings on working memory and perceptual speed factors defined by the in-lab assessments. Our findings demonstrate that averaging across brief cognitive assessments made in uncontrolled naturalistic settings provide measurements that are comparable in reliability to assessments made in controlled laboratory environments.
Traditional approaches to measuring cognitive function have relied on testing procedures administered by trained technicians, in laboratory or clinical settings, and usually on a single occasion. The artificial nature of standard testing environments and unmeasured sources of within-person variability can negatively affect the ecological validity and reliability of standard approaches to cognitive assessment (Allard et al., 2014; Timmers et al., 2014). Health and social scientists have addressed similar concerns by using ecological momentary assessment (EMA) methods to study people’s physical symptoms, psychological states, and behaviors in real-time and in people’s natural environment (Shiffman, Stone, & Hufford, 2008; Smyth & Stone, 2003). EMA is a particular approach to ambulatory assessment defined by several key features (Shiffman et al., 2008). First, data collection occurs in real-world environments as participants go about their daily activities, which is the “ecological” aspect of EMA. Second, assessments emphasize an individual’s current state, which is the “momentary” aspect of EMA. Third, EMA involves the strategic sampling of moments for assessment. For example, random sampling would facilitate characterization of a person’s “average” or typical level on a variable of interest (e.g., memory performance) through representative sampling of occasions, situations, and settings. And fourth, EMA involves multiple assessments which can elucidate how behaviors arise and vary over time and different situations.
Recent advances in mobile technology have enabled researchers to embed objective assessments of cognitive function into studies that use EMA methods. Important questions remain, however, regarding the reliability and validity of cognitive assessments administered using mobile technology in uncontrolled and naturalistic settings. The purpose of the present study is to evaluate the reliability and construct validity of a brief battery of cognitive tests administered on smartphones as part of an EMA study.
The Rationale for Ambulatory Cognitive Assessment
A primary motivation for ambulatory assessment is to enhance ecological validity. Ecological validity is especially important for the field of neuropsychology, which has become increasingly focused on understanding the relationship between assessment results and real-world cognitive function (Spooner & Pachana, 2006). Cognitive scientists and neuropsychologists have, for the most part, been concerned with the verisimilitude of cognitive assessments, which refers to the ecological validity of stimuli, responses, and other task demands (Spooner & Pachana, 2006), and veridicality, which refers to the predictive relationship between cognitive assessment and activities of daily living. Ecological validity, however, reflects more than face validity (verisimilitude) and external validity (veridicality). Ecological validity also, and perhaps most important, pertains to questions of which settings of daily life are represented in a particular study design and data collection procedure (Fahrenberg, Myrtek, Pawlik, & Perrez, 2007). Relatively little attention has been paid to this important facet of ecological validity which concerns the nature of the settings in which measurements of cognitive function take place (Timmers et al., 2014).
Traditional approaches to cognitive assessment take place in physical and social environments that are fundamentally dissimilar to the environments in which people perform cognitively demanding tasks in their daily lives. Although laboratory and clinical settings provide the benefit of experimental control over the testing environment, this control provides no assurance that cognitive performance observed in the laboratory is similar to how people would perform in the real world. A classic example of ecological validity concerns is “white-coat hypertension,” which refers to the observation that some people have higher blood pressure readings when taken in clinical settings compared with when measurements are taken in their natural environment. With regard to cognition, performance measured in clinic or laboratory research settings may—for example—be either elevated through social facilitation (e.g., Strauss, 2002) or hampered by unintended evaluative and stereotype threat (e.g., Schmader, Johns, & Forbes, 2008). Regardless of the specific differences, the clinical and laboratory settings in which researchers typically measure cognitive abilities are very unlike the situations in which people typically use those abilities.
The use of mobile technology in cognitive research can address ecological validity concerns through ambulatory assessment of cognitive function in people’s natural environment as they go about their daily lives. Ambulatory assessment can also improve the reliability of measuring individual differences in cognitive function. Traditional approaches to studying cognition typically rely on measurements obtained on single testing occasion. Such “single-shot” assessments are influenced by both random and systematic within-person variability (e.g., sleep quality, recent exposure to stressors) that is difficult or impossible to control. Aggregating across repeated ambulatory measurements “cancels out” the effects of within-person variability and can improve measurement precision and reliability by estimating an individual’s average level of functioning (Shiffman et al., 2008; Sliwinski, 2008).
In addition to addressing ecological validity and reliability concerns, ambulatory methods also permit frequent cognitive assessments that can be made in conjunction with other measurements of a person’s environment and internal states. Repeated and proper sampling of situations can aid researchers in more accurately characterizing a person’s average level of cognitive performance over the variety of naturalistic contexts in their daily lives (Brunswik, 1943) as well as in exploring dynamic relationships between cognitive performance and other time-varying psychological, social, and biological processes.
There are only a few examples describing the use of ambulatory cognitive assessments in EMA designs. In an early study, Shiffman, Paty, Gnys, Kassel, and Elash (1995) compared the cognitive performance of regular and light smokers under naturally occurring conditions of abstinence and smoking by administering repeated cognitive tests via a handheld computer. In a later study, they used cognitive assessments embedded in an EMA design to examine the time course and natural history of the effects of nicotine withdrawal on cognitive function (Shiffman et al., 2006). Another study used EMA to demonstrate the acute effects of alcohol consumption at ecologically relevant doses on ambulatory working memory performance (Tiplady, Oshinowo, Thomson, & Drummond, 2009). In a study of older adults, Allard et al. (2014) found that ambulatory semantic memory was higher during the hours following engagement in specific intellectual activities such as reading and word puzzles compared with other activities. A recent study combined EMA, ambulatory cognitive assessment and passive physiological monitoring to demonstrate that age moderates the effects of cardiovascular activation on working memory performance (Riediger et al., 2014).
These examples describe how researchers have used naturally occurring behaviors, contexts, and psychological and biological states to predict ambulatory cognitive function. Other applications have used ambulatory cognitive assessments as sensitive indicators of individual differences and predictors of proximal psychological states and behaviors. For example, Allard et al. (2014) found that aggregated scores on an ambulatory semantic memory test were more sensitive indicators of individual differences in hippocampal volume than scores from the same memory test administered by conventional means (single-shot, in the lab). Waters and colleagues have used modified versions of the standard and emotional Stroop task to obtain ambulatory measurements of attentional bias in individuals being treated for substance dependency. They found that reliable Stroop effects could be obtained using mobile-based administration, and that elevated attentional bias, measured on a momentary basis (A. J. Waters & Li, 2008), may predict imminent temptation episodes among recovering addicts (A. J. Waters, Marhe, & Franken, 2012).
Apart from the studies reviewed above, there have been relatively few studies that have used mobile technology (e.g., smartphones) to assess cognitive function via repeated, momentary assessments. There is also a lack of evidence regarding the reliability and validity of scores obtained by brief cognitive assessments administered by smartphones in uncontrolled, naturalistic settings. One notable exception is a recent study by Dirk and Schmiedek (2015), in which they described the reliability and validity of smartphone-based ambulatory assessments of working memory in third- and fourth-grade students. They found that that scores averaged across the 1-month period exhibited very high between-person reliability (>0.99) but also that there was evidence of systematic within-person variability across measurement occasions.
The Present Study
The purpose of the present study is to evaluate the reliability and construct validity of a brief battery of ambulatory cognitive tests administered on smartphones. The ambulatory tests were designed to measure two distinct and fundamental cognitive constructs: perceptual speed and working memory. Perceptual speed, which reflects the ability to quickly and accurately compare different types of stimuli (e.g., numbers, symbols, patterns), is important for understanding child and adolescent cognitive development (Rucklidge & Tannock, 2002; Stenneken et al., 2011), cognitive aging (Salthouse, 2000), and is also predictive of real-world outcomes, such as job performance (Ackerman & Beier, 2007) and vehicular crash risk (Owsley et al., 1998). Working memory ability, which reflects the capacity to maintain information in active memory while simultaneously performing distracting or interfering activities (Just & Carpenter, 1992), is also predictive of developmental, aging, and real-world outcomes (Alloway & Alloway, 2010; Schneider-Garces et al., 2010). One perceptual speed and two distinct working memory tasks were embedded in a signal contingent EMA protocol, which involved prompting individuals to complete a brief assessment five times per day for 14 days using study provided smartphones. Each assessment included some self-report questions (e.g., current mood, location, recent activity), followed by the cognitive tests.
Given the demands on participants imposed by an EMA design, each assessment, and therefore, each cognitive task must be kept very brief. For example, the ambulatory speed test used in the current study consisted of just 12 trials at each assessment, compared with laboratory assessments of speeded performance which may include 100 trials or more. We designed each of the three ambulatory cognitive tasks to be completed in less than 1 minute. The necessary brevity of ambulatory cognitive tests coupled with their unsupervised administration in uncontrolled, naturalistic settings raises legitimate questions about measurement reliability. That is, can scores obtained from brief ambulatory assessments capture systematic (i.e., reliable) sources of variance in cognitive performance?
The precise type of variance that counts as systematic (or error), however, depends on the purpose for which EMA data are being used. One way to use EMA data involves aggregating across measurement occasions in order to characterize between-person individual differences (Shiffman et al., 2008). For this purpose, only systematic between-person differences are of interest, and all sources of within-person variation (i.e., across occasions, across trials within occasions) are considered error. Another use of EMA data involves analyses of within-person variability to identify contextual associations and time-ordered effects across measurement occasions. For this purpose, within-person variance across occasions is of primary interest, and variability within measurement occasions (e.g., across trials) is considered error. Thus, one goal of the present study is to examine whether and to what extent scores obtained on brief, repeated ambulatory assessments capture systematic between-person and within-person (across occasion) variability in cognitive performance.
The second goal of the present study is to determine whether scores obtained from ambulatory cognitive tests measure the same constructs (i.e., perceptual speed and working memory) that are commonly studied in laboratory settings. This goal is especially important for comparing results obtained through ambulatory methods with those obtained through more traditional assessment approaches. Thus, we examined the pattern of correlations between cognitive tasks administered on smartphones in naturalistic settings and a battery of tasks administered in a traditional controlled laboratory setting that are valid indicators of the perceptual speed and working memory.
Method
The current investigation is part of a larger study, The Effects of Stress on Cognitive Aging, Physiology and Emotion. The Effects of Stress on Cognitive Aging, Physiology and Emotion project is a prospective measurement burst study (Nesselroade, 1991; Sliwinski, 2008) that tracks variability and change in daily experiences of stress and cognitive function across four 14-day EMA measurement bursts (Scott et al., 2015). The present study reports data from the smartphone-based and laboratory cognitive tests from the baseline measurement burst.
Participants
Participants were 219 adults (34% men, 66% women) aged 25 to 65 years (M = 46.99, SD = 10.74) recruited using systematic probability sampling of New York City Registered Voter Lists for the zip code 10475 (Bronx, NY). A total of 9% of participants self-identified as non-Hispanic White, 63% as non-Hispanic Black, 17% as Hispanic White, 6% as Hispanic Black, <1% as Asian, and 4% as other. Most of the sample had completed high school or some college (50%) or had a college degree (45%). During the data collection period (2011-2013), 51% of the sample was working, 12% was retired, 26% was unemployed looking for work, and 10% was unemployed not looking for work. The sample displayed marked diversity in income as well: 5% reported an annual income <$4,999, 17% between $5,000 and $19,999, 24% between $20,000 and $39,999, 20% between $40,000 and $59,999, 12% between $60,000 and $79,999, 7% between $80,000 and $99,999, 8% between $100,000 and $149,999, and 1% greater than $150,000 annually; 7% declined to report income. Married persons made up 31% of the sample; 9% was not married but was living with someone, 17% was divorced or separated, 34% was never married, 2% was widowed, and 6% described marital status as other. The sample slightly overrepresents women (66% of the sample compared with 58% of the population), but is representative of the area from which it is obtained with respect to age, race, and income.
Procedure
During recruitment, introductory letters were mailed to individuals from a sampling frame (obtained from the Registered Voter Lists) and a research assistant phoned to establish eligibility, and enroll and consent interested persons. Eligibility criteria included aged 25 to 65 years, ambulatory, fluent in English, without visual impairment that would interfere ability to use the study smartphone, resident of Bronx County. Participants were mailed paper survey batteries assessing demographic and individual difference characteristics which they complete at home and brought to their first lab visit. At the first visit to the research offices, participants completed 1.5 hours of training on the use of study smartphones to complete surveys.
Beginning the next day after their first visit, participants completed 2 days of the EMA protocol then returned to the lab for a second visit to complete a battery of cognitive tests and to determine compliance with the EMA protocol. The EMA protocol involved carrying specially programmed study smartphones that were programmed to beep at quasirandom times to alert participants to complete five ambulatory assessments per day. The scheduled interval between beeps was between 2 and 3 hours, the average time between scheduled beeps was 2 hours and 33 minutes. Participants who completed 80% of the EMA protocol were invited to the 14-day EMA study. At the end of 14 days, participants returned phones to the lab. Participants were generally highly adherent to the study protocol, responding to an average of 85% surveys; most (79%) survey responses were completed within 10 minutes of the beep prompt. Participants who satisfactorily completed the entire study protocol could receive up to $160.
Measures
In-Lab Cognitive Assessments
Participants were individually tested by a trained technician in a quiet testing room in a single session. In addition to the tests of perceptual speed and working memory, tests of fluid and crystallized intelligence, and associative learning were also administered. Because of its focus on reliability and construct validity of ambulatory perceptual speed and working memory tasks, the present analyses used data from the in-lab perceptual speed and working memory tests. We also examined the correlation of fluid intelligence with in-lab and ambulatory cognitive tests because of its established association with speed and working memory.
Fluid Intelligence
Raven’s Progressive Matrices (Raven, 2000) was administered as a measure of fluid intelligence. Items in Raven’s Progressive Matrices tests consisted of a series of geometrical shapes and participants were asked to guess what the next shape should be among multiple options.
Perceptual Speed
Tasks consisted of a symbol match, a letter match, and a number match task. Each task required participants to make speeded perceptual comparisons among items present in a visual display. The dependent measure for each of the perceptual speed tasks was the median reaction time for correct trials.
Symbol match
Participants saw six symbol pairs at the top of the screen and were presented with a comparison symbol pair at the bottom of the screen. They had to decide as quickly as possible whether the target pair at the bottom of the screen exactly matched one of the pairs at the top of the screen. Participants completed 70 trials of this task.
Letter match
Two letter strings were presented and participants identified whether the strings were the same or different as quickly as possible. Stings were three, six, or nine characters in length and mismatched strings varied by one character. Participants completed 70 trials with one break at trial 35.
Number match
Participants saw two number strings and indicated whether the strings were the same or different as quickly as possible. Strings were three, six, or nine items in length and participants completed 70 trials with a break midway through the test.
Working Memory
Working memory tasks consisted of three tasks that previous work has demonstrated are valid indicators of working memory capacity: operation span, counting span (Conway, Cowan, Bunting, Therriault, & Minkoff, 2002), and a backward letter span task (G. S. Waters & Caplan, 2003).
Counting span
The counting span task was administered following procedures described in Engle et al. (1999). Participants were instructed to memorize the number of targets in a series of displays that included dark blue circles (targets) and dark blue squares (color distractors) and light blue circles (shape distractors). Each display was initiated by the tester. Participants counted the number of dark blue circles aloud and repeated the digit corresponding to the final count. The number of targets per display varied from two to six, with three trials of each. After two to six displays, a recall cue was presented, at which point participants reported the number of targets in each of the previous displays, in the serial order in which they occurred. The dependent measure was the total number of target counts recalled in the correct order.
Operation span
In the operation span, participants verified equations aloud while trying to remember letters. As with the counting span, after a series of number of equations and letters was presented, participants were prompted to recall all the letters from that series. The number of equation and letter pairs per series varied from two to five with three series of each length presented in a fixed random order. The dependent measure was the total number of letters recalled in correct order.
Backward span
Participants saw a series of letters presented one at a time for 1 second each. At the end of each series, participants recalled all of the letters they saw in reverse order. The number of letters in each series varied from 3 to 8 and participants attempted 2 trials at each length for a total of 12 trials. The dependent measure was the total number of the items for trials that were recalled in the correct order.
Ambulatory Cognitive Assessments
Each ambulatory assessment consisted of a brief survey that assessed affective state and stress, as well as physical/social activities at the time of the prompt. After completing each survey, participants performed three ambulatory cognitive tasks in the following fixed order: symbol search, dot memory, and n-back. The ambulatory cognitive tasks were administered on Droid X which has a 4.3” display (480 × 854 pixels) and a 60 Hz refresh rate, and response times were recorded in milliseconds.
Symbol Search
On each trial of the symbol search task, participants saw a row of three symbol pairs at the top of the screen and were presented with two symbol pairs at the bottom of the screen. Stimuli were presented until a response was provided there was an interval of 200 msec. between each response and the following stimulus. Participants decided, as quickly as possible, which of the two pairs presented at the bottom of the screen was among the pairs at the top of the screen (see Figure 1). Participants completed 12 trials of this task. The dependent variable was median response time of correct trials. Because this task requires speeded comparisons similar to standard laboratory tests, we reasoned it would be a viable indicator of perceptual speed.

Example of symbol search test.
Dot Memory
Each trial of the dot memory task consisted of 3 phases: encoding, distraction, and retrieval (see Figure 2). During the encoding phase, the participant was asked to remember the location three red dots appear on a 5 × 5 grid. After a 3-second study period, the grid was removed and the distraction phase began, during which the participant was required to locate and touch Fs among an array of Es. After performing the distraction task for 8 seconds, and empty 5 × 5 grid reappeared on the screen and participants were then prompted to recall the locations of the 3 dots presented initially and press a button labeled “Done” after entering their responses to complete the trial. Participants completed 2 trials (encoding, distractor, retrieval) with a 1-second delay between trials. The dependent variable was an error score with partial credit given based on the deviation from the correct positions. If all dots were recalled in their correct location, the participant received a score of zero. In the case of one or more retrieval errors, Euclidean distance of the location of the incorrect dot to the correct grid location was calculated, with higher scores indicating less accurate placement and poorer performance (Siedlecki, 2007).

Example of dot memory test: (a) The participants were asked to remember the location of three red dots. (b) After grid was removed, participants were required to touch Fs among Es for 8 seconds. (c) Empty 5 × 5 grid reappeared on the screen and participants were prompted to recall the location of initial three dots.
The rationale for our use of this task as an indicator of working memory has both an empirical and theoretical basis. Previous research (Miyake, Friedman, Rettinger, Shah, & Hegarty, 2001) has demonstrated that a similar dot memory task loaded on a factor representing working memory. The authors of this study reasoned that the spatial dot memory task placed high demands on controlled attention—a hallmark of working memory tasks. Indeed, individual differences in working memory capacity arise “in situations where information needs to be actively maintained or when a controlled search of memory is required” (Unsworth & Engle, 2007, p. 123). The ambulatory dot memory task satisfies this requirement by using an interference task to prevent rehearsal and produce interference with encoded locations, which creates the demand for active maintenance and controlled retrieval of previously encoded location during the recall phase.
n-Back
Participants saw a series of three standard playing cards slide from one box on the right of the screen to the second box on left of the screen. In the practice phase, cards were facing up and participants were asked to determine if the target card in the rightmost box matched the test card in the leftmost box. After each response, there was a 500 msec delay after which the cards shifted positions from right to left, with a new target card appearing in the rightmost box (see Figure 3). After 10 practice trials, participants started the 2-back condition in which each new target card would turn face down prior to shifting positions, requiring participants to retain the card’s identity in working memory. As the cards shifted positions, participants were asked to determine if the current target card matched the facedown card they saw two trials back while retaining the identity of the previous target card. Feedback on response errors was provided by turning all cards face up. Participants completed 12 trials and the dependent variable was the proportion of correct responses. We selected the n-back because of its widespread use as an indicator of working memory (Schmiedek, Lövdén, & Lindenberger, 2014).

Example of n-back test: (a) In the practice session, participants were asked to determine whether the two cards matched. (b) Cards shifted position from right to left, with each new target card turning facedown. (c) Participants were asked to determine if the current target card matched the facedown card they saw two trials back.
Results
Descriptive Statistics
Table 1 shows the means and standard deviations for each of the ambulatory and in-lab cognitive tasks. Because both age and fluid intelligence are related to working memory and perceptual speed, we also examined correlations of task performance with age and performance on the Raven’s Matrices. The correlations between age and two of the three ambulatory tasks (dot memory and symbol search) were comparable to the age correlations with the in-lab tasks of working memory and perceptual speed. Specifically, older age was associated with higher error scores on the ambulatory dot memory task (r = .16, p = .02) and worse performance on the in-lab counting span (r = −.12, p = .09), operation span (r = −.19, p < .01), and backward span (r = −.16, p = .02) tasks. Age was also correlated with slower reaction times on the ambulatory symbol search (r = .35, p < .01), as well as with the in-lab letter match (r = .35, p < .01), number match (r = .29, p < .01), and symbol match (r = .36, p < .01) tasks. Age was not significantly correlated with the accuracy on the ambulatory n-back task (r = −.04, ns). Higher scores on the Raven’s test were associated with better performance on both the ambulatory dot memory (r = −.42, p < .01) and n-back tasks (r = .47, p < .01) and faster response times on the ambulatory symbol search task (r = −.50, p < .01). These correlations were comparable in magnitude to the in-lab working memory (rs ranged from .27 to .38) and speed (rs ranged from −.41 to −.43) tasks.
Descriptive Statistics.
Note. N = 219. Ambulatory tasks indicated by bold text.
Pearson correlation coefficients: *p < .05. **p < .01. ***p < .001. bRaven’s Progressive Matrices. cUnit: second. dUnit: Euclidean distance. eUnit: number of correct response.
Reliability Analysis
Between-Person Reliability
Between-person reliability refers to the proportion of total variance in scores attributable to differences between individuals. Aggregation of EMA scores may boost reliability (due to averaging) for measuring individual differences to levels that exceed typical “single-shot” measurement approaches (Shiffman et al., 2008). We used SAS PROC MIXED to estimate multilevel mixed models (MLMs) in order to compute the between-person (BP) reliability of scores on the ambulatory cognitive tests using the following formula (Raykov & Marcoulides, 2006):
where Var(BP) is the total variance in scores that is between persons, Var(WP) is the total variance in scores that is within persons, and n is the number of assessments. To determine the reliability of a test based on a single measurement, n would equal 1, which defines one type of intraclass correlation (ICC). The ICC is the expected correlation between two randomly sampled measurements from the same person, and hence reflects the stability of measurements. To determine the reliability of aggregated scores based on the average of multiple measurements, n would be set equal to the number of observations on which the average is based.
Table 2 shows the between- and within-person variance and resulting reliabilities produced by fitting unconditional MLMs using restricted maximum likelihood to each of the ambulatory tasks. The ICCs were 0.39 for the dot memory task, 0.54 for the symbol search task, and 0.59 for the n-back. The ICCs indicate that between 39% and 59% of the variance in performance across the three ambulatory tasks was between persons. The reliabilities of average scores aggregated across all the ambulatory assessments were exceptionally high: 0.97 for the dot memory task and ≥0.98 for both the symbol search and n-back tasks. These reliabilities are based on scores that reflect the average of 56 (80%) of the ambulatory assessments, which was the approximate average number of assessments participants completed in this study.
Reliabilities for Individual and Aggregated Ambulatory Cognitive Test Scores.
Note. ICC = intraclass correlation. There are five assessment occasions per day, so 1 day reflects 5 assessments, 2 days reflect 10 assessments, and so on.
We next conducted a series of follow-up analyses to determine how many days of ambulatory assessment would be required to obtain reliabilities of aggregated scores that exceeded 0.80 and 0.90. To do this, we fit a sequence of unconditional MLMs and calculated reliabilities of average scores using only observations from the first study day, from the first 2 study days, and from the first 3 study days. The bottom of Table 2 displays these results, indicating that 2 days of ambulatory assessments (10 total assessments) are required to attain a reliability >0.80 for the dot memory task, and 3 days of assessments (15 total assessments) are required for aggregated scores to attain a reliability of 0.90. Averaging across a single day of 5 assessments produced scores with reliabilities ≥0.90 for the Symbol Search and n-back.
Because our sample contained people who varied widely in age (25-65), and because age is related to cognitive function, we next conducted a sensitivity analysis to examine whether this age heterogeneity influenced the reliability results. First, we recalculated the ICCs and reliability of the average scores after statistically partialling for age. The age partial ICC for the symbol search task (0.50) was slightly lower compared with the unconditional ICC reported in Table 2. The ICCs for the dot memory and the n-back task were not changed by statistically partialling for age. The reliabilities for the average scores aggregated across the 14 days of EMA remained above 0.96 for all three tasks after partialling for age. Second, we categorized individuals into four equal width age bins (age 25-34, 35-44, 45-54, 55+) and conducted an age-stratified analysis to determine whether the tests exhibited similar patterns of between-person and within-person variability across ages. Table 3 indicates the ICCs exhibited an increasing trend for older ages, particularly for the symbol search and n-back tasks. Inspection of the variance components from the stratified analyses indicated that this was primarily due to more between-person heterogeneity among older compared with younger participants. For all age strata, reliability exceeded 0.85 for scores aggregated across 3 days, and 0.95 for scores aggregated across 14 days.
Between-Person Reliabilities for Ambulatory Cognitive Test Scores by Age Group.
Note. ICC = intraclass correlation. Values in the table indicate reliabilities for a single assessment (ICCs) and average scores that are aggregated across either first 3 or entire 14 days of ambulatory assessments.
Next, we examined whether reliability remained constant across the 2-week ambulatory assessment period. Separate unconditional MLMs were fit to the data from the five ambulatory assessments made on each of the 14 study days. Then, reliabilities of the average scores for each day were calculated for each of the three ambulatory tasks, and plotted in Figure 4. Reliabilities were very stable across the 14 days: The standard deviation of daily reliabilities was 0.01 for the symbol search task and 0.02 for both the dot memory and n-back tasks. Thus, increased practice with the tasks did not alter the reliability of aggregated scores.

The between-person reliability of average scores obtained on each of the days of ambulatory assessments.
Within-Person Reliability
Next, we examined whether there was evidence of systematic within-person variation in cognitive performance across assessments using an approach similar to other studies of within-person cognitive variability (Brose, Schmiedek, Lövdén, & Lindenberger, 2012; Schmiedek, Lövdén, & Lindenberger, 2013; Sliwinski, Smyth, Hofer, & Stawski, 2006). This examination determined whether within-person differences in cognitive performance from one occasion to the next reflect systematic variance by decomposing within-person variation into two sources: across occasions and within occasions. Because our interest is in across-occasion variability, this source of variability is considered systematic and within-person variability within occasions (i.e., across trials) is considered error. 1 The logic for quantifying the proportion of systematic within-person variance across assessment occasions is analogous to estimation of between-person reliability shown in Equation (1):
where Var(WPoccasion) is the within-person variance in scores across assessment occasions, Var(WPtrial) is the within-person variance across trials (within occasions) of scores, and i is the number of trials. The symbol search task consists of i = 12 trials, and the dot memory task consists of i = 2 trials. As the n-back task consist of 12 binary trials (1 = correct, 0 = incorrect), we followed a procedure used by Brose et al. (2012) and calculated the proportion correct for odd and even trials and treated i = 2 (scores from odd and even trials) for this task. We used SAS VARCOMP to quantify systematic within-person reliability by partitioning the within-person variance into variability across assessment occasions and variability across trials within occasions for each of the three ambulatory cognitive tests as described by Cranford et al. (2006, p. 925, Formula 5). Results indicated within-person reliabilities of 0.53 for the symbol search task, 0.50 for the dot memory task, and 0.41 for the n-back task.
We next conducted a sensitivity analyses to determine whether within-person reliabilities were similar across age. Table 4 shows that the within-person reliabilities for the symbol search were constant across age strata, but that the within-person reliabilities for both the dot memory and n-back tasks were slightly lower for the older age strata. Finally, we examined whether within-person reliabilities remained constant across the duration of the study or whether reliabilities changed with repeated practice on the ambulatory tasks. To accomplish this, we estimated within-person reliabilities separately for each of the 14 days of the study, based on the five daily assessments obtained on each of those days. As this approach treats each study day separately, the resulting reliabilities (displayed in Figure 5) only reflect systematic sources of within-person variance that transpires within a given day, and do not reflect systematic sources of within-person variability that transpire across days. Therefore, these study day–specific reliabilities will tend to be somewhat lower than reliabilities calculated from the entire data series, which reflect both within-day and across-day sources of systematic within-person variance. Inspection of Figure 5 indicates that the within-person reliabilities for the symbol search and dot memory tasks were relatively stable across study days: The standard deviation of within-person reliabilities was 0.04 for both the symbol search task and dot memory tasks. The within-person reliabilities for the n-back were somewhat more variable (SD = 0.07). Inspection of the data indicated that the within-person reliabilities were much more variable for the first several days compared with the remaining days: Excluding the data from the first 3 days reduced the SD of within-person reliabilities for the n-back to 0.04.
Within-Person Reliabilities for Ambulatory Cognitive Test Scores by Age Group.

The within-person reliability estimated separately for each of the 14 days of ambulatory assessments.
Validity Analysis
Construct validity of the ambulatory tasks was assessed by first examining the pattern of correlations among the ambulatory and in-lab tasks, and then by conducting a confirmatory factor analysis (CFA) to assess formally the fit of our hypothesized measurement model. Table 5 shows the intercorrelations among all the ambulatory and in-lab cognitive tasks. The ambulatory symbol search task was correlated strongly with all of the in-lab speed tasks (rs from .61 to .74). The ambulatory dot memory task correlated significantly with in-lab working memory tasks (rs from −.39 to −.45), which were similar in magnitude to the correlations between the in-lab backward span task and operation span (r = .42) and counting span tasks (r = .48). The correlation between the operation and counting span tasks was higher (r = .60) likely because of shared method variance (i.e., they are both complex span tasks). The ambulatory n-back task exhibited significant correlations with in-lab working memory tasks (rs from .24 to .36) that were similar in magnitude to its correlations with in-lab speed tasks (rs from −.29 to −.34). These patterns of correlations are not surprising, given that the task demands of the n-back involve aspects of working memory (e.g., memory updating) but also speeded perceptual comparisons.
Intercorrelations Among Ambulatory and In-Lab Cognitive Tasks.
Note. Pearson correlation coefficients: All correlations were significant, p < .01. Ambulatory tasks are in bold text. Scores on the n-back and the three complex span tasks reflect accuracy (higher values = better performance); the dot memory task reflects an error score (higher values = poorer performance). Scores on the Symbol search and the three matching tasks reflect response time (higher values = poorer performance).
We next fit a CFA using Mplus (Version 7.2) software to provide a formal evaluation of our hypothesized measurement model. The model specified two correlated latent variables: (a) a working memory factor, indicated by three in-lab tasks (operation span, counting span, backward span) and two ambulatory tasks (dot memory and n-back) and (b) a perceptual speed factor, indicated by three in-lab tasks (number, letter, and symbol match) and one ambulatory task (symbol search). The initial model had unacceptable fit (χ2[26] = 94.94, degrees of freedom [df] = 26, p < .001; comparative fit index [CFI] = .925, root mean square error of approximation [RMSEA] = .110; 90% confidence interval [CI .087, .134], standardized root mean square residual [SRMR] = .057), and examination of modification indices suggested that model fit would be improved by adding covariances between pairs of indicators that shared common method variance. Specifically, we added a covariance between the operation and counting span variables (because they are both complex span tasks) and between the in-lab symbol match and the ambulatory symbol search (because they both shared a common stimulus set). The resulting model, shown in Figure 6, had acceptable fit (χ2[24] = 45.82, p = .005; CFI = .976, RMSEA = .063 90% CI [.035, .092], SRMR = .045). Standardized regression weights were all significant and >|.56| for the working memory factor and >|.66| for the speed factor. The correlation between the two factors (−.53) is consistent with speed working memory correlations in the literature (Verhaeghen & Salthouse, 1997). Adding cross-factor loadings did not improve model fit, indicating good discriminant validity.

Confirmatory factor model and standard coefficients.
We conducted two follow-up analyses to examine the sensitivity of these results to the effects of age. That is, because our sample varies considerably in age and because age is related to cognitive performance, is possible that the correlations among the tests are driven by age effects. Therefore, we reran the CFA partialling for age to determine the extent to which the factor loadings may have been driven by shared age differences in memory and speed. Thus, we fit a CFA identical to the model depicted in Figure 6, but with the addition of regressing each indicator variable on age, so that the factor loadings reflected effects residualized for individual differences in age. The residualized factor loadings on the working memory factor (dot memory = −.68, n-back = .57, counting span = .59, operation span = .54, and backward span = .66) and the speed factor (symbol search = .57, symbol match = .78, letter match = .83, number match = .87) remained high, statistically significant and were relatively unchanged compared with their loadings from the unadjusted model (see Figure 6). The fit for the age-partial CFA was also acceptable: (χ2[23] = 43.24, p < .01; CFI = .979, RMSEA = .063 90% CI [.033, .092], SRMR = .040). These results indicate that the pattern of correlations between the ambulatory and in-lab cognitive tests is not a by-product of the age heterogeneity of the sample.
Next, we examined whether the CFA factor model applied to the younger and older members of the sample. Accordingly, we split the sample into two subsamples at the median age (age 48) and tested for factorial age invariance between the younger and older subsamples. We compared the fit of a sequence of two-group confirmatory factor models that stipulated configural (same factors structure), metric (equal factor loadings), scalar (equal loadings, equal intercepts), and strict (equal loadings, equal intercepts, equal residual variances) invariance between the younger and older subsamples (Yoon & Millsap, 2007). Table 6 displays the results from these models. Model fit was not significantly worse for increasingly stringent invariance models and strict invariance obtained between the younger and older subsamples.
Model Fit Statistics for Factorial Age Invariance Models.
Note. df = degrees of freedom; CFI = comparative fit index; RMSEA = root mean square error of approximation; RMSEA 90% CI = 90% confidence interval for the RMSEA.
Discussion
Findings from the current study contribute to the literature by demonstrating both the reliability and construct validity of brief ambulatory cognitive assessments conducted by smartphones in people’s natural environments. Specifically, there were three main findings: (a) between-person reliability of ambulatory cognitive test scores averaged across the 14-day EMA protocol were ≥0.97, (b) within-person reliabilities (0.41-0.53) indicated the ambulatory tests captured systematic fluctuations in memory and speeded performance across occasions, (c) performance on the ambulatory tests of memory and speed was correlated with in-lab cognitive assessments of the same constructs. In addition, supplementary analyses indicated that these findings were consistent across duration of the study protocol (e.g., robust to practice effects) and invariant across age. We now discuss the implications of these findings and how they extend previous work on ambulatory cognitive assessments, as well as some important limitations and directions for future research.
Reliability of Ambulatory Cognitive Assessments
Between-Person Reliability
Our findings describe two different types of reliability: between-person and within-person reliability. With regard to between-person reliability, all three of the ambulatory cognitive tests produced highly reliable average scores (≥0.97), consistent with reliability values obtained from ambulatory working memory tests administered to school-aged children (Dirk & Schmiedek, 2015). The results also demonstrated that good between-person reliabilities can be obtained with fewer than 14 days of ambulatory assessments. For example, averaging across assessments from a single day produced a reliability of 0.75 for the dot memory task, and reliabilities ≥0.90 for the symbol search and n-back tasks. Averaging across the 3 days produces scores with reliabilities of ≥0.90 for all the ambulatory cognitive tasks. The boost to reliabilities obtained by averaging across repeated assessments would also apply to cognitive tests repeatedly administered in more conventional laboratory or clinic settings. Obtaining many repeated measurements in a controlled laboratory setting that are distributed across a relatively narrow interval (e.g., within a day or week) would, however, present a number of logistical challenges, including personnel, space, and time constraints. Using mobile technology, such as smartphones, to obtain repeated assessments can increase the feasibility of conducting intensive longitudinal studies of cognitive function.
It is important to recognize, however, that these reliability values are not a property of a test per se, but depend on a specific assessment procedure. The reliability values provided for standard neuropsychological tests presume that those tests are administered by a trained tester and according to prescribed procedures. Similarly, the reliability values provided in Table 2 presume a certain set of procedures, which included pretraining of the participants on the tasks, as well as the specific assessment protocol which distributed five measurements across multiples times of the day for a 2-week period.
The reliability values reported in Table 2 could be used for planning purposes to determine the required number of assessments required to obtain reliable cognitive scores; however, researchers must also consider the frequency and the time period across which measurements are distributed. This has significant implications for planning assessment protocols because administering 20 trials of the dot memory task, for example, on a single assessment occasion may not result in equivalent reliability as scores based on the same number trials distributed across multiple assessments. Similarly, distributing 20 trials across five assessments in a single day may produce different results than distributing those 20 trials across a week. Scores based on the average of multiple temporally distributed assessments may boost reliability because they “average out” the effects of state-based influences, such as fatigue and stress, as well as other sources of natural variation that could affect performance on any given measurement occasion (Allard et al., 2014). There is still relatively little research on the psychometric properties of ambulatory cognitive assessments, and insufficient data to make clear-cut recommendations as to whether researchers should favor a few lengthy assessment (with many trials) or multiple brief and temporally distributed cognitive assessments.
Within-Person Reliability
The issue of planning the temporal spacing of assessments becomes even more complex if one is interested in studying within-person variability in cognitive performance. Decisions about how often, how frequently, and over what time period to measure cognitive function should take into account the cadence of the underlying time-varying processes that may drive fluctuations in function (Ram & Gerstorf, 2009; Sliwinski, 2011; Sliwinski, Almeida, Smyth, & Stawski, 2009). For example, if most of the systematic within-person cognitive variability occurs over a relative fast time scale (e.g., across hours), then a given ambulatory cognitive test may exhibit lower within-person reliability if assessments are made only once per day compared with if assessments were made multiple times per day (Schmiedek et al., 2013). Our results showed that the proportion of systematic within-person variance that transpired across occasions (separated by a few hours) relative to within-person variability transpiring within occasions varied between 0.41 and 0.53, depending on the task.
Very little work has examined within-person reliabilities of cognitive assessments. A few studies have done so in adult samples using measurements made in laboratory settings (Brose, Lövdén, & Schmiedek, 2014; Sliwinski et al., 2006) and in children assessed in the classroom and in their homes (Dirk & Schmiedek, 2015). The present results extend this research in two ways. First, our study examined the reliability of within-person fluctuations in ambulatory cognitive performance among adults (ages 25-65). And second, the ambulatory assessments were made at unpredictable (i.e., pseudorandom) times and took place in random locations and situations. Thus, the present study demonstrates the feasibility of embedding cognitive tests in EMA protocols for studying the systematic effects of time-varying and contextual influences on cognitive function among adults measured in naturalistic settings.
That said, the within-person reliabilities (0.41-0.53) were lower than the between-person reliabilities (>0.95 for all tasks). For comparison, previous laboratory-based intensive repeated measurement studies have reported within-person reliabilities in the range from 0.50 to 0.70 for response time tasks (Brose et al., 2014; Sliwinski et al., 2006) and in the range from 0.15 to 0.36 for accuracy measures (Brose et al., 2014). In an ambulatory study, Dirk and Schmiedek (2015) reported within-person reliabilities for working memory tasks (accuracy scores) in the range from 0.58 to 0.70 for third- and fourth-grade students assessed in the classroom and in their homes. Although the within-person reliabilities of the present set of brief ambulatory cognitive tasks are comparable to previously reported values from other tests, these values raise the question of whether the within-person reliabilities of cognitive tests are sufficient to detect the effects of contextual and time-varying variables, such as stress, mood or fatigue, on performance.
Indeed, a number of studies have reported significant effects of time-varying variables on dependent measures with low to modest within-person reliability. For example, the negative affect scale used in the National Study of Daily Experiences, which is part of the Midlife in the United State study, has a within-person reliability of only 0.55 (Charles, Piazza, Mogle, Sliwinski, & Almeida, 2013), yet there are dozens of publications from this data set showing time-varying influences on the negative affect that were detectable. Other studies reporting low to modest (0.18-0.70) within-person reliabilities of cognitive measures have been able to detect time-varying effects mood, motivation, and stress (Brose et al., 2014, 2012; Sliwinski et al., 2006). However, the effects involving lower reliability measures will be attenuated compared with effects involving measures with higher reliability. This attenuation decreases statistical power for detecting an expected effect for a given sample size and design. Researchers planning future studies using ambulatory cognitive assessments should power their studies to detect within-person effects that are much smaller than the expected effect sizes. Lower within-person reliabilities, however, may be tolerable for intensive repeated measures studies (e.g., daily diary and EMA designs) because these designs have many observations which can partially offset the effects of unreliability.
One obvious approach to increasing within-person reliability is to increase the number of trials made at each assessment. According to Equation (2), the within-person reliability of the dot memory task could theoretically be increased to 0.80 if the number of trials were increased from 2 to 8. Doing so, however, would increase the assessment time on each occasion, adding to participant burden and possibly having an adverse effect on compliance. Moreover, increasing the length of cognitive tests administered in an ambulatory setting might increase the likelihood of an external distraction disrupting performance. Finally, certain time-varying influences, such as stress, could increase the trail-to-trial level variability on a given occasion (Sliwinski et al., 2006), effectively reducing within-person reliability of across occasion measurements. The systematic study of procedures for ambulatory cognitive assessment is in its very early stages and we know very little about how to design cognitive tests that have optimal psychometric properties to detect systematic within-person effects, particularly across short time scales. Additional work is required to determine how to optimize within-person reliability of ambulatory cognitive assessment procedures.
Validity of Ambulatory Cognitive Assessments
Two findings support the construct validity of the ambulatory working memory and perceptual speed tasks used in this study. First, the ambulatory dot memory and symbol search tasks exhibited similar correlations with age and fluid intelligence as the in-lab working memory and perceptual speed tasks, respectively. The ambulatory n-back task also correlated with fluid intelligence; however, it was not significantly correlated with age. The lack of association between the n-back and age likely resulted from the low difficulty level of 2-back procedures (compared with, e.g., 3-back tasks), and is consistent with other studies that have found no age differences on 2-back tasks administered in a laboratory setting (Schmiedek, Li, & Lindenberger, 2009; Wild-Wall, Falkenstein, & Gajewski, 2011). In addition, the ambulatory n-back task exhibited roughly equal first-order correlations with working memory and perceptual speed indicators. This may have been due to the n-back being a speeded task and it not being as not sufficiently difficult. Together, these two features of the ambulatory n-back task may have resulted in performance that was largely constrained by processing speed rather than working memory capacity because the working memory demands were relatively light. Despite its correlation with fluid intelligence and its high loading on the working memory factor, the ambulatory n-back task used in the current study did not exhibit good discriminant validity—it correlated as strongly with perceptual speed measures as it did with working memory measures. Therefore, the version of the ambulatory n-back task used in this study may not be a useful indicator of individual differences.
Second, a CFA showed that the ambulatory tasks loaded significantly and strongly on their hypothesized factors. Because these factors were also defined by standard in-lab cognitive tests, this result indicates that ambulatory working memory and speed tasks measure the same underlying constructs that researchers study in controlled testing environments. Although using different cognitive tests for the ambulatory and in-lab assessments precluded making direct comparisons between assessment modes, it also made the present demonstration of construct validity even more compelling. That is, evidence that the ambulatory and in-lab tests measure the same underlying constructs cannot be attributed to shared-method variance that could have resulted by using the same procedures for ambulatory and in-lab assessments. Supplementary analyses demonstrated that evidence of construct validity was not a by-product of the age heterogeneity of the sample and was age invariant.
It is also important to note, however, that these analyses only support the between-person construct validity of the ambulatory tasks for analyses of individual differences. We were not able to examine the construct validity at the within-person level since we did not have a sufficient number of ambulatory indicators (i.e., ≥3) of the working memory and speed constructs. Additionally, we did not explore possible validity threats to the ambulatory tasks, such as the impact of external distractions (e.g., noise). That said, our results provide a novel demonstration of a strong pattern of correlations between ambulatory cognitive tests administered in uncontrolled settings and standard working memory and processing speed tests administered in the laboratory.
Limitations and Considerations for Future Research
Although our results contribute to the literature on ambulatory cognitive assessment, they must be considered with respect to several limitations. First, the same procedures were not used in both ambulatory and in the lab, precluding direct comparison of performance differences between ambulatory and in-lab cognitive assessments. Some people might perform systematically worse (or better) in naturalistic environments compared with laboratory or clinical settings, perhaps due to features of their everyday environment, features of the testing environment, or their dispositions (e.g., test anxiety). In particular, ambulatory cognitive assessments might be more sensitive than in-lab assessments to contextual and time-varying influences, such as stress, physical symptoms, social activity, and health behaviors.
A second limitation is that each ambulatory cognitive assessment consisted of relatively few trials per task to minimize participant burden. One negative consequence is that the small number of trials per occasion may have negatively affected within-person reliability. In addition, the small number of trials precluded in-depth examination of variability in performance across faster time scales (i.e., within occasions) which other studies have shown to exist (Dirk & Schmiedek, 2015). Although systematic within-person variability may operate across time scales that we did not examine (e.g., across minutes), our findings demonstrated that fluctuations in memory and speeded performance throughout the course of a day (across hours) reflects systematic sources of intraindividual variability.
A third limitation is that we did not sample a broad range of memory functioning (e.g., secondary and primary memory) in either the laboratory or ambulatory setting. Secondary memory is strongly correlated with, and is an important component of, working memory (Unsworth & Spillers, 2010). Therefore, we cannot rule out possibility that the ambulatory dot memory task is a better indicator of secondary than working memory.
A fourth consideration is that our protocol for sampling occasions prompted individuals to complete their cognitive assessments at unpredictable times during the day in order to obtain a representative sampling of situations. This type of repeated and random sampling of occasions allows ambulatory methods to characterize a person’s average (or typical) cognitive performance over a wide variety of contexts in their daily lives with a high degree of reliability. This approach to cognitive assessment, however, differs from traditional laboratory or clinic-based approaches which seek to obtain an individual’s best or maximum rather than their average or typical performance. For some purposes, such as assessing scholastic aptitude, measuring an individual maximum performance might be ideal. However, for other purposes, such as detecting subtle cognitive impairment or the influence of social context on cognitive function, a person’s average performance may be more sensitive than their maximum performance.
Finally, the analyses reported do not yet demonstrate value added to conducting ambulatory versus in-lab cognitive assessments. Additional research is required to study the potential added value by obtaining intensive repeated measurements of cognition in ecological settings. Such demonstrations might involve comparing the predictive validity of ambulatory with in-lab cognitive assessments for some important health outcome (e.g., incident Alzheimer’s disease or mild cognitive impairment) or increased sensitivity to detecting cognitive change due to much higher reliability compared with single-shot assessments (Sliwinski, 2008). Other demonstrations could focus on identifying contextual and “real-time” determinants of cognitive function (such as emotional distress or rumination) as potential targets for time-sensitive interventions delivered through mobile technology (Heron & Smyth, 2010). We hope that these findings support and motivate future research to utilize ambulatory methods to study the influences of environmental, social, psychological, physiological, and behavioral factors on cognitive function in people’s natural environments.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by National Institute of Health (NIH) grants R01 AG039409, R01 AG042595, P01 AG03949, CTSA 1UL1TR001073 from the National Center for Advancing Translational Sciences (NCATS), the Leonard and Sylvia Marx Foundation, and the Czap Foundation.
