Abstract
School systems across the country are transitioning from paper-based testing (PBT) to computer-based testing (CBT). As this technological shift occurs, more research is necessary to understand the practical and performance implications of administering CBTs. Currently, there is a paucity of research using CBTs to examine working memory (WM) performance, even though CBTs may negatively influence performance. The present study compared a WM CBT and PBT and found enhanced WM performance on the PBT across several verbal and visuospatial WM tests. This pattern was evident even after age was controlled, indicating that test mode effects were persistent across ages (4-11 years). CBTs on WM performance may yield lower scores due to developmental WM differences, increased cognitive workload, test mode effects stemming from individual access to technology, and participant characteristics, such as developmental, biological, or gender differences. The presence of divergent WM in CBT and PBT indicates the need for additional options for children at risk of academic failure because of testing modality.
School systems across the country are transitioning their standard testing modality from paper-based testing (PBT) to computer-based administration (Boeve, Meijer, Albers, Beetsma, & Bosker, 2015). Technology that incorporates computer-based testing (CBT) is becoming increasingly prevalent in universities and secondary schools alike (Sedlacik & Cechova, 2016). These changes are driven primarily by not only considerations of practicality, convenience, and efficiency but also large-scale school computer integration (Randall & West, 2014). Since the early 2000s, several U.S. states have required that their high school students engage in primarily online learning (Jacob, Berger, Hart, & Loeb, 2016).
CBT is promoted as a way of revolutionizing education and improving student performance and test scores through unique computerized test-taking tools. Proposed benefits of CBT include improved security, accelerated result reporting, more efficient year-round testing, flexible scheduling, environmental benefits, and increased student focus and enjoyment (Boeve et al., 2015). In addition, students are increasingly adept with computerized technology, and the widespread integration of desktop computers, laptops, iPad devices, and smartphones in schools makes students today part of a “digitally native” generation (Katz & Elliot, 2016).
On the contrary, administering strictly CBTs may cause some students to perform poorly simply due to a shift from PBTs, not because of lower ability—referred to as the “test mode effect” (Leeson, 2009). Early theorists argued that for CBTs and PBTs to be equivalent, meaning similar in both content and necessary cognitive abilities, the two tests should produce identical results. The presence of test mode effects suggests that CBTs introduce confounding variables that negatively affect test scores. Test mode familiarity (TMF) deals primarily with participant issues, and holds that some students’ performance may suffer if they have limited experience with CBTs (Hanho, 2014).
Criterion Test Equivalency
Despite the shift toward CBTs and students’ increasing familiarity with technology, findings in criterion-based tests (e.g., math and reading) have been mixed. Some studies have found no difference between the two testing modalities (Hosseini, Abidin, & Baghdarnia, 2014; Wang, Jiao, Young, Brooks, & Olson, 2007) while others have found test mode effects (Jeong, 2012; Khoshsima & Toroujeni, 2017; Noyes & Garland, 2008). For purposes of this article, recent research has found no significant differences between CBT and PBT performance in cognitive assessments (Collerton et al., 2007; Register-Mihalik et al., 2012).
Cognitive Assessment Equivalency
Collerton et al. (2007) compared the use of CBTs of cognitive function with PBTs. Eighty-one participants over the age of 85 were randomly assigned to the paper Wechsler Adult Intelligence and Memory Scale, or the computerized Cognitive Drug Research (CDR) verbal memory and attention assessment battery. These tasks were chosen on the basis that they assess the same cognitive areas as other 85+ cognitive studies, both take 30 min to complete, and have been previously validated to be accurate cognitive measures for people in this age group. Participants assigned to the CBT did not rate the test as difficult, unacceptable, or stressful. Participants who completed the PBT version finished the assessment at a higher speed, but did not score significantly better than those who took the CBT. The researchers concluded that TMF does not play a major role in these cognitive assessments and thus, CBTs and PBTs were equally feasible and valid measures of cognitive function.
In addition, maintaining format equivalence when examining paper-based and computer-based formats of cognitive intelligence scales (e.g., Wechsler Preschool and Primary Scale of Intelligence–Fourth Edition [WPPSI-IV]) has been examined looking at Q-interactive formats (i.e., iPads; Daniel, Wahlstrom, & Zhang, 2014; Raiford et al., 2016; Wechsler, 2008). The overall goal is to demonstrate raw score equivalence (e.g., equivalence across paper-based and digital administration) through reliability and validity analyses of the paper format and the subsequent application of the digital format. Raiford and colleagues (2016) tested this in 38 children aged 2 to 7 by comparing performance on an iPad-administered WPPSI-IV to a PBT WPPSI-IV. The effect sizes fell within the Q-interactive equivalence format, and the authors concluded that regardless of test format, the WPPSI-IV produces equivalent scores.
Working Memory (WM)
WM (i.e., the ability to process and recall information) is critical for a variety of activities at school, ranging from complex subjects such as reading comprehension, mental arithmetic, and word problems to simple tasks like copying from the board and navigating around school (Alloway & Copello, 2013). WM is important from kindergarten to the tertiary level and is an excellent longitudinal predictor of academic success (Alloway & Alloway, 2010).
There are multiple models of WM, some of which incorporate concepts of attention in memory (Engle, Kane, & Tuholski, 1999) and temporal duration in performing memory tasks (Barrouillet, Bernardin, & Camos, 2004), while others suggest that WM is an activated component of long-term memory (Cowan, 2005). The model used in the present study is Baddeley’s (2000) model, which consists of four components: the central executive, the phonological loop, the visuospatial sketchpad (VSSP), and the episodic buffer. The central executive is responsible for the high-level control and coordination of the flow of information through WM, including the temporary activation of long-term memory. It has also been linked with control processes such as switching, updating, and inhibition (Baddeley, 1996). The central executive is supplemented by two slave systems specialized for storage of information within specific domains. The phonological loop provides temporary storage for linguistic material, and the VSSP stores information that can be represented in terms of visual or spatial structure. The fourth component is the episodic buffer, responsible for integrating information from different components of WM and long-term memory into unitary episodic representations (Baddeley, 2000). This model of WM has been supported by evidence from studies of children, adult participants, and neuropsychological patients (Gathercole & Baddeley, 1993).
The key feature of WM is its capacity to both store and manipulate information. WM functions as a mental workspace that can be flexibly used to support everyday cognitive activities that require both processing and storage such as mental arithmetic. However, the capacity of WM is limited, and the imposition of either excess storage or processing demands in the course of an ongoing cognitive activity will lead to catastrophic loss of information from this temporary memory system. In contrast, short-term memory refers to the capacity of storing units of information, and is typically assessed by serial recall tasks involving arbitrary verbal elements such as digits or words. In a similar vein, recent models have been created to improve WM function and have been found to effective for cognitive improvement (Jaeggi, Buschkuehl, Jonides, & Shah, 2011).
WM and CBTs
Some aspects of WM, such as the VSSP, may be more affected by CBTs than others. The VSSP consists of two subsystems: one that maintains spatial information, and the other for visual information (Pickering, Gathercole, Hall, & Lloyd, 2001). Some researchers argue that developmental fractionation, or the differential maturation rates of visual and spatial memory, provides evidence that the two components of the VSSP may in fact be unrelated constructs (Hitch, 1990). Logie and Pearson (1997) investigated this in children (5-12 years) using a memory test for sequences of movements to targets, or visual patterns. Results illustrated that memory for patterns was better than memory for movement sequences in the oldest group. This was counterbalanced in another condition and the same pattern emerged. These findings support the idea that visuospatial WM includes distinct forms of storage for visual and spatial material due to developmental differences across ages.
In a similar vein, Choi and Tinker (2002) found that developmental differences in third and 10th graders affected performance on CBT assessments. Specifically, there was a higher mode effect for tasks involving a textual focus (i.e., phrases or words) compared with a sentence-scanning task. The researchers concluded that when students are younger, it is easier for them to scan for words on PBTs versus CBTs, possibly because of developmental fractionation in the VSSP. The older students stored and processed information differently than the younger students, which may result in test mode effects of the CBT. More recently, Colbert and Bo (2016) examined change-detection tasks (i.e., WM assessment) and evaluated the relationship to the Wechsler Intelligence Scale for Children (WISC-IV). Results indicated significant age-related improvements on verbal and visuospatial WM, meaning that as students age, multiple subsets of WM improve.
An alternative theory is that the activation of the subcomponents of the VSSP depends on a dynamic/static distinction in the stimulus rather than a spatial/visual one. Pickering et al. (2001) conducted an experiment with 5-year-old, 8-year-old, and 10-year-old children who completed a mazes visuospatial WM task. Static and dynamic presentation formats were used to observe the differences in the two subcomponents. In the mazes static condition, a child was presented with empty mazes in a booklet and asked to recall the different routes denoted by a clear red line. The maze booklet was presented to the child for 3 s and then the child was asked to redraw the routes after the booklet was removed. In the mazes dynamic condition, the child was presented with the same booklet, but the experimenter demonstrated the route the child should draw by tracing it with their finger. The child was then prompted to draw the route the experimenter just traced. Testing continued in both the static and dynamic maze tasks until the child incorrectly drew the patterns. Results indicated that the 8- and 10-year-old children performed significantly better than the 5-year-olds on the static version of the maze tasks, but not on the dynamic version. This suggests that the two components of visuospatial memory may be dependent on the static/dynamic features of the stimulus and that VSSP development occurs earlier with static stimuli than with dynamic stimuli. These findings are related to mode effects in that CBTs may affect the VSSP more in older children and lead to higher performance in the static condition. In the present study, we integrated age effects on testing modality using similar age bands confirmed by past WM modes (Alloway, Gathercole, & Pickering, 2006).
Current Study
As schools across the country change their testing modalities from PBT to CBT assessments, researchers and educators need to ensure that this shift is beneficial for all students. To date, little research has been conducted on the comparability of WM performance on CBTs to PBTs, and given the implications of WM in education (Alloway & Alloway, 2010), further investigation is needed to ensure that CBTs are not negatively influencing WM performance. Researchers should investigate how CBTs affect WM performance to improve learning support, and develop accurate cross-modal measures of WM.
Method
Participants
Overall, 1,339 students (656 males, 683 females, Mage = 7.54 years, age range = 4-11) were recruited from several schools in England. For the participants aged 4 to 11, a nationally representative demographic sample was recruited on the basis of the national average of performance on national assessments in English, mathematics, and science that pupils sit for in the final year of primary school at the age of 10 or 11. Schools in England are ranked on the basis of a combined or “aggregate” score achieved in the three tests—the maximum possible being 300 (Department for Education and Skills). Schools located in both urban and rural settings were selected for the normative sample represent a range of low, average, and high performance in the combined score of the national test results. Participant demographics can be found in Table 1.
Demographics Inclusive of Age Bands and Test Modality.
Note. PBT = paper-based testing; CBT = computer-based testing.
Materials and Procedure
Participants were randomly assigned to a paper-based version of the Automated Working Memory Assessment (AWMA) or the computer-based AWMA. The AWMA (Alloway, 2007) consists of tests measuring verbal short-term memory, verbal WM, and visuospatial memory. Standard scores (M = 100; SD = 15) for individual tests and composite scores for each memory component were generated automatically by the AWMA for each child based on their age.
The three verbal short-term memory measures were digit recall, word recall, and nonword recall. In each test, the child hears a sequence of verbal items (digits, one-syllable words, and one-syllable nonwords, respectively) and has to recall each sequence in the correct order. The stimuli were presented at a rate of one per second. The normative sample included individuals aged 4.5 and 22.5 years, and test–retest reliability is .88, .89, and .69 for digit recall, word recall, and nonword recall, respectively.
The three verbal WM measures were listening recall, backward digit recall, and counting recall. In the listening recall task, the child is presented with a series of spoken sentences, has to verify the sentence by stating “true” or “false,” and recalls the final word for each sentence in sequence. In the backward digit recall task, the child is required to recall a sequence of spoken digits in the reverse order. The numbers are presented at a rate of one per second. In the counting recall task, the child is presented with a visual array of red circles and blue triangles. He or she is required to count the number of circles in an array and then remember the tallies of circles in the arrays that were presented. Each visual array stayed visible until the child indicated that she or he had completed counting all the circles. For individuals aged 4.5 and 22.5 years, test–retest reliability is .88, .83, and .86 for listening recall, counting recall, and backward digit recall, respectively.
Two measures of visuospatial short-term memory were administered. In the mazes memory task, the child is shown a maze with a red path drawn through it for 3 s. She or he then has to trace in the same path on a blank maze presented on the computer screen. In the block recall task, the child views a video of a series of blocks being tapped, and reproduces the sequence in the correct order by tapping on a picture of the blocks. The blocks were tapped at a rate of one block per second. For individuals aged 4.5 and 22.5 years, test–retest reliability is .86, and .90 for mazes memory and block recall, respectively.
Tests were administered, in a fixed sequence design to vary task demands across successive tests. The CBT version of the memory tests was presented on a laptop computer with the screen resolution set to 600 by 480 pixels. For the spoken presentation of stimuli, audio files were recording using a minidisk player and then edited on the GoldWave program (2004). All picture files were created in Microsoft PowerPoint using the standard shape graphics. The test trials were presented as a series of blocks; each block consists of six trials. The CBT version used audio files, while the experimenter read out the stimuli in the PBT version to a group of participants. The timing for stimuli presentation was the same for both the CBT and PBT versions, as indicated above. Participants were required to respond immediately after stimulus presentation, though there was no set time limit for responding in either version. The instructions were presented as a sound file while the computer screen was blank. Practice trials followed the instructions. Participants in either version were not allowed to return to any questions (i.e., item review).
For scoring in the CBT version, the experimenter recorded the child’s response using the right arrow key on the keyboard (→) for a correct response and the left arrow key on the keyboard (←) for an incorrect response. The computer program automatically credits a correct trial with a score of 1. In the PBT version, the experimenter records the participant’s response in the score booklet. According to the “move on” rule, if a child responds correctly to the first four trials within a block of trials, they automatically receive credit for trials that were not administered. However, if three or more errors are made within a block of trials, the test is discontinued. In the CBT version, the scoring, move on, and discontinue rules are automated. The score for that test reflects the number of correct responses up to the point at which the test was ended. Additional information can be found regarding the psychometric properties (reliability and validity) can be found from the manual (Alloway, 2007). Further details of test reliability and validity are reported in Alloway, Gathercole, Kirkwood, and Elliott (2009), respectively.
Results
Age
Descriptive statistics are shown in Table 2, Table 3, and Table 4. To compare the effect of age and test type (CBT or PBT) on WM performance, three MANOVAs were conducted on the three WM assessments (verbal short-term, verbal, and visuospatial) with age bands (4-6, 7-8, 9-11) and test types as the fixed factors. The overall group term associated with Hotelling’s t test is reported in all instances, with Bonferroni correction post hoc comparisons set at an alpha level of p = .0125.
Descriptive Statistics as a Function of Test Types Separated by Age Band 4-6.
Note. PBT = paper-based testing; CBT = computer-based testing.
Descriptive Statistics as a Function of Test Types Separated by Age Band 7-8.
Note. PBT = paper-based testing; CBT = computer-based testing.
Descriptive Statistics as a Function of Test Types Separated by Age Band 9-11.
Note. PBT = paper-based testing; CBT = computer-based testing.
The first MANOVA performed on the verbal short-term memory tests indicated significant main effects for age group, F(6, 2660) = 123.23, p < .001,
The second MANOVA performed on the verbal WM tests indicated significant main effects for age group, F(6, 2480) = 175.47, p < .001,
The third MANOVA performed on the visuospatial memory tests indicated significant main effects for age group, F(4, 2482) = 273.70, p < .001,
Gender
To compare the effect of gender on performances on the WM tests (CBT & PBT), three separate MANCOVAs were conducted on the memory scores, with age as a covariate and sex and testing types as the independent factors. The overall group term associated with Hotelling’s t test is reported in all instances, with Bonferroni correction post hoc comparisons set at an alpha level of .05.
The first MANCOVA performed on the verbal short-term memory tests indicated a significant group difference for testing type, F(3, 1332) = 41.16, p < .001,
The second MANCOVA performed on the verbal WM tests indicated a significant group difference for testing type, F(3, 1242) = 63.77, p < .001,
The final MANCOVA performed on the visuospatial memory tests indicated a significant group difference for testing type, F(2, 1243) = 496.53, p < .001,
Discussion
The primary finding of this study was that WM performance was better on a PBT WM assessment than a CBT assessment across several WM tests. This pattern was evident even after age was controlled, meaning that the CBT/PBT effects were persistent across ages (4-11 years). Additional analyses that divided children into theoretically derived developmental bands (e.g., 4-6, 7-8, 9-11) yielded similar results. To date, little research has been conducted comparing WM performance on CBTs to PBTs, but the present results suggests that CBTs may yield lower scores in assessing WM. Possible reasons include, increased cognitive load, technological mode effects due to socioeconomic status (SES), and developmental and biological gender differences. In addition, item review, or the ability to review and change answers, was not permitted on either the PBT or CBT AWMA, and may have influenced performance levels. WM assessments have an important temporal element, but item review may give underperforming students a chance to improve by taking more time to review their submitted answers.
Recall was significantly better on the PBT than the CBT across item type (digits, one-syllable words, and one-syllable nonwords) on verbal short-term memory. One explanation for the enhanced PBT performance may be the increased cognitive workload and subsequent decrease in WM capacity observed in past research on computer screens (Mayes, Sims, & Koonce, 2001; Noyes & Garland, 2003). If more resources are allocated to a WM task (e.g., rehearsal of a word list), the workload of the individual is increased, and their WM capacity is lowered. Mayes et al. (2001) investigated the effects of cognitive workload on WM capacity by comparing comprehension performance on a paper test to performance on a visual display terminal (VDT) test. Their findings indicated that VDTs increased reading time and cognitive load and negatively affected comprehension scores. Noyes and Garland (2003) speculated that the increased cognitive load may have decreased the amount of rehearsal processing, leading to decreased WM which affected comprehension.
The significant interaction effects for nonword recall may stem from the fact that verbal short-term memory does not operate in isolation, but in a complex cognitive system (Archibald, 2006). Nonwords high in “wordlikeness” (more similar to words that participants already know, known as the wordlikeness effect) may have played a role in the better performance of older children. The older the participant, the better their attention to nonwords that was developed through repetition accuracy for both low- and high-wordlike nonwords. Increased cognitive load from the WM task itself (i.e., high demand rehearsal activity) may have led to a decrease in ability to store information due to the technological mode effects of CBTs. Mayes et al. (2001) found that people required more time reading from computer screens than on paper to retain the same material, and argued that this effect was due to a higher cognitive workload CBTs placed on the information processing system. In this study, attending to the CBT may have been more difficult than attending to the PBT, which may explain the lower performance on this modality. On the contrary, Prisacari and Danielson (2017) did not find differences in cognitive load in those taking a CBT versus PBT assessment. Participants (n = 222) completed three chemistry CBT or PBT assessments and found no significant differences in either testing modality, but the researchers did find that students used a piece of scratch paper more for PBT assessments compared to CBT especially for algorithmic questions. The researchers believed that if cognitive load and certain behaviors of students are different between testing modes, teachers might have reason to select one mode over another. Using a piece of scratch paper may be more difficult in a CBT due to having to copy down the question and referencing both the computer and scratch paper, while on a PBT the question would already be near their scratch paper. To examine cognitive load, a paper N-back task should be compared with a computerized version. To date, this test has only been administered in a computerized format, but may provide exceptional training for improving WM (Soveri, Antfolk, Karlsson, Salo, & Laine, 2017).
Computer fatigue (CF) from technological mode effects such as computer resolution can influence performance and create higher cognitive load. Although the resolution used in this experiment was high (600 × 480 pixels), Ziefle (1998) found through the examination of eye movement parameters that reading from paper produces less fatigue than computers, particularly computers with poor screen resolution. Higher levels of fatigue in the CBT condition could be attributed to the flicker of the screen, phosphorescence effects, and the glare of the screen (Krummenacher, 1996). Although access to high-resolution computers has improved in schools, Grey, Thomas and Lewis (2010) found that the ratio of high-resolution computers to students was 1 to 5. Currently, CF may be less pronounced because of improvements in screen resolution and school computer availability, but needs to be examined at state and city levels to determine geographical disparities in technology access (Hohlfeld, Rizhaupt, Dawson, & Wilson, 2017). Indeed, SES may be a large factor in which schools receive these types of computers. For individuals in schools with access to better technology, mode effects may not be an issue, while less affluent students may suffer from computer-induced deficits due to less access and overall practice, which decreases their familiarity with test mode and hence performance (Dolan, 2016).
Other technological mode effects including browsing and tactile influences may have led to lower WM scores on CBTs. Participants may have encountered spatial knowledge issues on the CBT verbal WM counting recall assessment in which students were asked to count the number of circles and triangles presented. This task may have been easier on paper because the participant could browse the exam, locate and point to the shapes using their hand, and easily remember their locations, as opposed to viewing them on a computer screen. Touching (i.e., tactile stimulation) information has also been found to lead to better memory (Bigelow & Poremba, 2014). Participants were not as able to use these tactile skills on the CBT visuospatial mazes task which required participants to recall a drawn path in a maze using their finger. It is more difficult to use your finger on a computer screen due to having to stretch your arm upward and not wishing to diminish the screen’s value.
In addition to mode effects, biological and gender differences may have played a role in performance differences across testing modalities. A meta-analysis by Maeda and Yoon (2013) summarized five key factors that may explain the differences in male and female performance: strategic, experiential, affective, biological, and test administration. Experiential refers to findings that males play more computerized video games than females, which provides them with an advantage on visuospatial tasks (Cherney, 2008). In the current study, males’ elevated performance on the mazes visuospatial task may be attributable to more experience with computer games increasing their visuospatial performance. This would also relate to Maeda and Yoon’s (2013) biological concept regarding gender differences, including hormonal and neuronal brain differences involved in different spatial awareness and mental rotation ability (Koscik, Leary, Moser, Andreasen, & Nopoulos, 2009) developmental differences for each sex, and different gender approaches to problem solving (Geiser, Lehmann, & Eid, 2006). Female participants in the current study may have had lower self-efficacy than males due to commonly held beliefs regarding visuospatial tasks, possibly creating the performance results (Moè, 2009).
CBTs may also differentially affect participants with lower automaticity on cognitive tasks. Automaticity refers to the ability to perform a task efficiently due to repeated practice, and thus streamlines information processing. People with lower levels of automaticity and WM capacity may have more difficulty transferring information from sensory memory to WM, decreasing efficiency of memory search and retrieval processes (Endres, Houpt, Donkin, & Finn, 2015). Coupled with having to take the test on the computer, having generally lower automaticity and rehearsal processes may have exacerbated the difference in performance.
Strategies for automaticity in information processing may also be underdeveloped in those who do not have access to computers, further compounding the TMF effect. Mayer and Moreno (2003) noted that by enhancing automaticity in WM, space could be devoted to other cognitive demands, which in turn reduced cognitive load and led to better learning outcomes. Instructors should activate existing knowledge prior to instruction to facilitate learning new information. Improving attentional control and enhancing automaticity for WM demands may lead to higher metacognitive control and improved WM, which ultimately could improve classroom performance.
Wallace and Clariana (2000, 2005) argued that computer-based practice exams should be readily available if CBT’s are to be ubiquitously implemented to better prepare students for computerized assessments. Students who are more competitive, lack access to computers on a regular basis, are not confident with computers, and who perform poorly on WM assessments (possibly due to biological or gender effects), should have an option for a PBT. To understand the components of WM affected by test modality, future studies should include additional measures to examine specific WM processes such as phonological awareness, different testing modalities that include pointers to locate items on the screen and on paper, collect frequency data thereof, and observe other WM assessments besides the AWMA to see if other WM assessments create similar test mode effects.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
