Abstract
Although information is frequently paired with music to enhance recall, there is a lack of basic research investigating how aspects of recorded music, as well as how it is presented, facilitate working memory. Therefore, the purpose of this study was to determine the effects of visual and aural presentation styles, rhythm, and participant major on working memory as measured by sequential monosyllabic digit recall performance. We isolated visual and aural presentation styles and rhythm conditions during six different treatment stimuli presented on a computer screen in the study: (a) Visual Rhythm; (b) Visual No Rhythm; (c) Aural Rhythm; (d) Aural No Rhythm; (e) Visual + Aural Rhythm; (f) Visual + Aural No Rhythm. Participants’ (N = 60; 30 nonmusic majors and 30 music majors) task was to immediately recall the information paired with music within each condition. Analyses of variance indicated a significant difference between the visual and visual + aural presentation style conditions with the visual + aural condition having more accurate recall. While descriptive data indicated that rhythm tended to facilitate recall, there was no significant difference between rhythm and no rhythm conditions. Nonmusic major participants tended to have slightly more accurate recall than music major participants, although this difference was not significant. Participants tended to have higher recall accuracy during primacy and recency serial positions. As participants had most accurate recall during the visual + aural presentation style conditions, it seems that the multi-sensory presentation modes can be effective for teaching information to be immediately recalled as long as they do not contain too much information and overload the limited storage capacity of working memory. Implications for clinical practice, limitations, and suggestions for future research are provided.
Literature review
Researchers have studied working memory over several decades (Baddeley, 2010) to better understand this complex phenomenon that is vital for learning (Alloway, 2006). Cowan (1998) described working memory as a group of processes that holds information in a form that can be used to assist in cognitive functions. Highlighting the relevance of working memory and its importance in learning, Baddeley (1992) summated that “Working memory stands at the crossroads between memory, attention, and perception” (p. 559).
In a classic and frequently cited paper, Miller (1956) proposed that working memory has a limited capacity that can be augmented by grouping (or “chunking”) information together. Moreover, in working memory tasks, people tend to recall items at the beginning and end of a list with greater accuracy than items in the middle. This is known as the serial effect (e.g., Murdock, 1962). When input is subdivided into smaller groups, authors have noted the expected serial effect, with higher recall for the beginning and end of each group and lower recall in the middle (Hartley et al., 2016; Hitch et al., 1996).
A plethora of researchers have studied the effects of musical training and musicianship on working memory (Fujioka et al., 2006; Hansen et al., 2013; Lee et al., 2007; Silverman, 2007, 2010, 2012; Silverman & Schwartzberg, 2014, 2019; Talamini et al., 2016, 2017). In a study comparing musicians’ and nonmusicians’ scores on audio, visual, and audiovisual digit span tasks, Talamini et al. (2016) found that musicians scored higher than nonmusicians overall, and specifically in the audio and audiovisual conditions. There was no difference between the two groups in the visual input condition. The authors also reported a correlation between scores on the audio and audiovisual digit span tasks and the melodic sub-score of a measurement of musicianship. Hansen et al. (2013) also found that musicians outperformed nonmusicians on a forward digit span task. However, the authors found that rhythmic, but not melodic, scores significantly correlated with performance on the forward digit span.
There is empirical support for connections between rhythm and working memory (Farrell, 2008; Saito, 2001) and between temporal processing, musicianship, and verbal recall (Jakobson et al., 2003). For example, Saito (2001) found a correlation between scores on a rhythm task and scores on digit span tasks presented both visually and auditorily. In this study, the visual input condition correlated more strongly with rhythmic scores than did the auditory input condition (Saito, 2001). Researchers have repeatedly demonstrated that consistent rhythmic or temporal grouping can enhance scores on short-term (Dowling, 1973; Hartley et al., 2016) and working memory (Bower & Winzenz, 1969; Fanuel et al., 2018; Plancher et al., 2018) tasks. More specifically, researchers have found evidence that rhythmic input increases scores on digit span tasks (Ee et al., 2015; Frankish, 1985; Hartley et al., 2016; Hitch et al., 1996; Silverman, 2007, 2010). In the first of a series of experiments on temporal grouping and digit recall, Frankish (1985) found that grouping improved scores for visually presented stimuli. The author then compared grouped sequences presented visually or both visually and auditorily and found that participants scored higher in the bimodal condition. A limitation of this study, however, was that the auditory input was accomplished by participants reading the digits that they saw aloud (Frankish, 1985). The act of speaking might have confounded results by improving memory through involvement of brain regions involved in speech. In another study on the effects of temporal grouping on working memory tasks with auditory or visual input, participants also performed significantly better with grouped stimuli (Hitch et al., 1996). In this study, the benefits of grouping were more pronounced in the auditory than in the visual condition.
Presentation style can also impact working memory and subsequent learning. Penney (1989) reviewed the effects of auditory compared to visual input on verbal short-term memory and hypothesized meaningful differences between how auditory and visual input are coded and processed. For example, there may be a tendency for auditory input to involving sequential thinking, whereas visual input can be used to take in multiple stimuli at once (Penney, 1989). Researchers have studied the effects of input modality on working memory (Hitch et al., 1996; Mishra et al., 2013; Powell & Hiatt, 1996; Silverman & Schwartzberg, 2019; Talamini et al., 2016; Yang et al., 2015). Notably, Mishra et al. (2013) found that scores on a digit recall task were higher when participants had both auditory and visual input instead of only auditory input when there was no concurrent executive control task. The authors also found a significant interaction between input mode and serial position, such that scores in the audio and visual condition were higher than those in the audio condition for numbers at the beginning and end of the sequence, but not in the middle of the sequence (Mishra et al., 2013). Similarly, Powell and Hiatt (1996) examined the effects of visual versus auditory input on scores for forward and backwards digit spans. In this study, participants were able to remember more digits in the visual than in the auditory condition during the backwards digit span but there was no difference between modalities in the forwards digit span.
Silverman and Schwartzberg (2019) studied the effects of auditory compared to auditory and visual presentation on digit recall. They further differentiated between three different types of auditory input: Chant, melody, and speech. In this study, the chant and melody conditions – but not the speech condition – included a repeating rhythmic phrase identical to the one used in the present study. During the visual conditions within this study, participants saw a woman who provided the information to be recalled. Silverman and Schwartzberg (2019) found that scores in the auditory conditions were significantly higher than their counterpart scores in the auditory and visual conditions for speech and melodic presentations. In the auditory and visual conditions, participants performed significantly worse in the speech condition than in the two other rhythmic input conditions. Finally, the authors reported primacy and recency serial effects in the data, and they also found that music majors’ scores were higher than those of nonmusic majors, although this difference was not significant.
The current study builds on Silverman and Schwartzberg (2019) by isolating the rhythmic element of information to be recalled and by continuing to study the differences between auditory and visual input for musicians and nonmusicians. However, in Silverman and Schwartzberg (2019), participants watched a video of a person delivering the information to be recalled. As there is a considerable amount of video-based learning wherein videos do not include visuals of the people delivering the information, the current study aims to address a gap in the literature by only displaying the information to be recalled (i.e., Arabic numerals but not a visual of the person delivering the information) on a computer screen. Therefore, the purpose of this study was to determine the effects of visual and auditory presentation styles, rhythm, and participant major on working memory as measured by sequential monosyllabic digit recall performance. Specific research questions included:
Are there between-group differences between music majors and nonmusic majors concerning recall? Do participants have more accurate recall when they see, hear, or both see and hear the information to be recalled? Do participants have more accurate recall when presented with information paired with rhythm? Do participants have more accurate recall during serial positions of primacy and recency?
Method
Research participants
Participants were 60 undergraduates from a large public university located in an urban area of the upper Midwestern United States. Participants had normal hearing and normal or corrected vision. Participants volunteered to take part in the study and were not paid. Of the 60 total participants, 30 participants were music majors (23 female and 7 male) and 30 participants were nonmusic majors (19 female and 11 male). 1 Participants were recruited via word of mouth and various classes within the school of music and were tested individually in a small office without windows. There was no between-group difference concerning age (p > .05) but there was a between-group difference in years of musical training, F(1, 58) = 29.58, p = .001, partial η2 = .338. Music majors (M = 12.50, SD = 3.49) had more years of musical training than nonmusic majors (M = 6.77, SD = 4.60). The authors’ affiliated Institutional Review Board approved the study (1307S39441).
Digits test
To assess working memory, the researchers utilized six separate nine-digit sequences based from the monosyllabic digits one through 10 (Egner et al., 2016; Miller, 1956; van Merrienboer & Sweller, 2005). Monosyllabic digits were used to avoid any previously established paired-associate relationships (Chase & Ericsson, 1982). Each digit occurred once in each series and functioned as the text for each melody. For example, the digit sequence for one condition was 4, 3, 6, 9, 10, 8, 2, 5, 1. The serial position of each digit in each of the six sequences was determined randomly via a computer program. The range of potential scores on each of the six separate digit tests was zero (low recall) to nine (high recall).
The researchers isolated visual and auditory presentation styles and rhythm and thus utilized six different treatment stimuli in the study: (a) Visual Rhythm; (b) Visual No Rhythm; (c) Aural Rhythm; (d) Aural No Rhythm; (e) Visual + Aural Rhythm; (f) Visual + Aural No Rhythm. During visual presentation style conditions, participants were able to only see the Arabic numbers to be recalled on the computer screen. Numbers were presented in red font on a black screen. During the aural presentation style conditions, participants did not see the numbers but heard the numbers to be recalled via a computer program (http://www.apple.com/accessibility/osx/voiceover/). During the visual + aural presentation style conditions, participants were able to both see the numbers on the computer screen as well as hear them. During the rhythm conditions, information was presented in a format that repeated two eighth notes followed by a quarter note three times (quarter note MM = 60). During the no rhythm conditions, the duration value of the notes was a quarter note (quarter note MM = 60). In the conditions with visual presentation style and rhythm, the numbers to be recalled were visually depicted using rhythm in the same manner as the aural presentation style only conditions (i.e., numbers were presented on the computer with quarter and half notes at MM = 60). This method was adapted from Farrell (2008), Frankish (1985), and Vitulli and McNeil (1990).
Procedure
After a trained and compensated undergraduate test administrator explained and obtained informed consent, participants listened to instructions and were allowed to ask questions to ensure they comprehended the study, procedure, and task. Participants then heard a recorded voice say “Ready…Go” and saw, heard, or saw and heard the nine digits on a 21-inch Apple desktop computer. After each set of digits, the voice instructed Record your answers and participants were then allowed to immediately write the digits they had heard onto a response sheet. There was no wait time between the last presented digit and the instruction to Record your answers. This process was repeated for each of the six conditions and the digits test thus assessed immediate recall. Participants were instructed to leave the answer space blank if they could not recall a number and were allotted as much time as necessary to write their answers on the response sheets. Participants were allowed to ask questions after the instructions were provided but did not complete a practice example. Participants were allowed to adjust the computer amplitude to a level that remained constant throughout the session (Jellison, 1976; Jellison & Miller, 1982) and completed the study in approximately seven minutes.
Previous researchers have noted order, learning, and practice effects can exist during studies that use repeated-measures designs to assess recall (Steel et al., 1997). The researchers thus used a Latin Square Design in an attempt to control for learning, order, and carry-over effects (Heiman, 2002). Participants heard each of the six different sequences in one of six possible orders.
Power analysis
The researchers conducted a power analysis to determine adequate sample size using G*Power 3.1.3 (Faul et al., 2007). Power analyses indicated 54 total participants were necessary in order to detect a medium partial η2 (.25) when α = .05 for a power of .95 with two independent treatment groups using a repeated measures analysis of variance consisting of within and between interactions (Kotrlik et al., 2011).
Analyses
A correct response was operationally defined as the correct digit written in the correct serial position on the response sheet. If a serial position was left blank, it was considered an incorrect response. The researchers assessed total digit recall performance in each of the six conditions. The researchers scored all results and performed data analyses using SPSS 24.0.
To determine if there were differences and interactions in the order and gender variables, these variables were included in the original repeated-measures analysis of variance (ANOVA). Order and gender were not significant and did not significantly interact with other variables, F(6, 78) = .560, p = .761, partial η2 = .041. Therefore, the researchers eliminated order and gender from subsequent analyses. To determine if there were significant differences in between-subjects variables (music majors versus nonmusic majors) and within-subjects variables of presentation style (visual versus aural versus visual + aural), and rhythm (rhythm versus no rhythm), a three-way repeated measures ANOVA was performed. Box’s Test of Equality of Error Variances was not significant, p = .951.
Results
The overall three-way interaction between major, presentation styles, and rhythm was not significant, F(2, 116) = 1.303, p = .357, partial η2 = .041. The two-way interactions involving major and presentation styles (F[1,116] = 0.423, p = .656, partial η2 = .007) and major and rhythm (F[1,58] = .004, p = .951, partial η2 = .000) and presentation style and rhythm (F[2,116] = .321, p = .726, partial η2 = .006) were not significant. Therefore, within-subject and between-subject factors did not interact with each other and did not significantly impact recall. Table 1 depicts descriptive statistics.
Research question 1: Are there between-group differences in music majors and non-music majors concerning recall?
Although nonmusic majors (M = 5.69, SE = 0.24) tended to slightly outperform music majors (M = 5.10, SE = 0.24), this difference was not significant, F(1, 58) = 2.98, p = .092, partial η2 = .048.
Research question 2: Do participants have more accurate recall when they see, hear, or both see and hear the information to be recalled?
There was a significant within-subjects effect concerning presentation style, F(2,116) = 4.316, p = .016, partial η2 = .069. Post hoc pairwise comparisons with Bonferroni adjustments for multiple comparisons indicated significant differences between the visual and visual + aural presentation style conditions, with the visual + aural presentation style condition having more accurate recall than the visual presentation style condition (mean difference = 0.74, p = .023, 95% CI = −1.40, −0.08).
Research question 3: Do participants have more accurate recall when presented with information paired with rhythm?
Although the rhythm condition (M = 5.57, SE = 0.18) tended to slightly outperform the no rhythm condition (M = 5.22, SE = 0.21), this difference was not significant, F(1, 58) = 3.629, p = .062, partial η2 = .059.
Research question 4: Do participants have more accurate recall during serial positions of primacy and recency?
The researchers did not analyze recall performance by serial position due to the interactions that typically occur. Rather, means of these data are graphically depicted in Figure 1. Although data varied, participants generally tended to have more accurate recall in primacy and recency positions. Specifically, recall tended to be most accurate during serial positions one and nine and least accurate during middle positions of four, five, and six.

Interactions Between Serial Positions and Conditions.
Discussion
Due to the importance of working memory in learning, the purpose of this study was to isolate and determine the effects of presentation styles, rhythm, and participant major on working memory as measured by sequential monosyllabic digit recall performance. There was a significant difference between the visual and visual + aural presentation style conditions, with the visual + aural presentation style condition having more accurate digit recall accuracy than the visual presentation style condition. Although rhythm tended to facilitate recall accuracy, there was no significant difference between rhythm and no rhythm conditions. While nonmusic major participants tended to slightly outperform music major participants, this difference was not significant. Participants tended to have higher recall accuracy during primacy and recency serial positions and lower recall accuracy during middle serial positions.
In the current study, the visual + aural presentation style condition significantly outperformed the visual presentation style condition. This result is incongruent with a related investigation wherein the auditory presentation style condition outperformed the visual + auditory presentation style (Silverman & Schwartzberg, 2019). However, the current study was different in that participants only saw numbers on a computer screen but did not see a person delivering the information to be recalled as they did in the Silverman and Schwartzberg (2019) study. Drawing on Cognitive Load Theory (Chandler & Sweller, 1991; Sweller, 1999), perhaps the addition of the person in the Silverman and Schwartzberg (2019) study overloaded the limited storage capacity of working memory (Baddeley, 1998). However, in an unordered recall digit task presented through audio only or through an audio and visual input of a person reciting numbers, Mishra et al. (2013) found that scores in the audio and visual condition were higher than in the audio condition alone. The current study only used the data to be recalled and that – paired with the auditory presentation style – may have been the optimal amount of complexity to engage participants in the task but to not overload working memory. Based on Cognitive Load Theory (Chandler & Sweller, 1991; Sweller, 1999), this study has implications for online teaching and learning.
In related studies using rhythm to chunk information for immediate serial recall, rhythm conditions typically have more accurate recall than no rhythm conditions (Ee et al., 2015; Frankish, 1985; Hartley, et al., 2016; Silverman, 2007, 2010, 2012; Silverman & Schwartzberg, 2014; Tulving & Craik, 2000). In the current study, the rhythm conditions outperformed the no rhythm conditions but this difference did not reach significance. Perhaps this lack of significance highlights that rhythm facilitates chunking (and subsequent recall) regardless of presentation style (i.e., aural, visual, or aural + visual).
Although not significant, nonmusic majors tended to have slightly more accurate recall than music majors. Previous researchers, however, have found that musicians tend to have more enhanced left hemispheres than non-musicians (Chan et al., 1998; Ohnishi et al., 2001; Schlaug et al., 1995). Moreover, the results from related recall studies indicated that music majors tend to have more accurate recall than nonmusic majors (Hansen et al., 2013; Silverman, 2010, 2012; Silverman & Schwartzberg, 2014; Talamini et al., 2016; Vandervert, 2015; Yang et al., 2015). The discrepancies between the current study and the existing literature warrant additional empirical investigation. However, this may have been influenced by the nonmusic major participants specific to the current study, who were enrolled in an Introduction to Music Therapy class and still had received considerable years of music training (M = 6.77, SD = 4.60).
As depicted in Figure 1, participants tended to have more accurate recall during positions of primacy and recency, which is congruent with existing research indicating increased recall errors as a function of item position (Mishra et al., 2013; Schunk, 2004). This finding has clinical practice implications for working memory as learners may have more difficulty recalling information in middle serial positions. To facilitate learning, clinicians might consider placing the most important material to be learned in primacy and recency positions and less essential material in the middle positions. Moreover, repetition of information in the middle positions may enhance recall.
A limitation of the study concerns the high number of years of music training in the nonmusic major group. In future investigations, researchers might consider using participants without any music training or including years of private music lessons as a potential covariate in analyses. Moreover, consistent operational definitions concerning years of private music instruction must be provided for participants to include music training as a potential covariate. Another limitation of the current study includes the single presentation of the information for immediate recall. While not the case in the current study, learners can often go back and review material in online teaching and learning platforms.
Suggestions for future research can include presentation mode delivery formats, such as live versus recorded and live versus computer-based delivery (Kerr & Symons, 2006). These data are important as considerable instruction is occurring in online platforms. Investigators might test the impact of seeing the instructor and compare it to conditions without a visual of the instructor (i.e., the only visual would be the presentation slides). Researchers also might attempt to study potential differences between working memory and long-term memory, as different types of memory may be enhanced via multi-sensory presentation (Chen & Hsieh, 2008; Hulme et al., 1987). A related study measuring participants’ brain activation could potentially determine how visual, auditory, and music-based components influence brain activation and subsequent recall performance. As there were incongruencies between the visual + aural presentation conditions and results between the current study and the Silverman and Schwartzberg (2019), it would be interesting to further isolate and investigate how much visual information it takes to overload working memory. Future investigators could take a more translational approach and use similar recall tests with children or clinical populations who experience memory deficits such as stroke (Leo et al., 2019) or Autism Spectrum Disorder (Schwartzberg & Silverman, 2012, 2018, 2019).
Descriptive Statistics.
Working memory is consequential for learning social and academic information. As information to be learned is often paired with music, the purpose of this study was to isolate and determine the effects of presentation styles, rhythm, and participant major on working memory as measured by sequential monosyllabic digit recall performance. As participants had most accurate recall during the visual + aural presentation style conditions, it seems that the multi-sensory presentation modes can be effective for teaching information to be immediately recalled as long as they do not contain too much information and overload the limited storage capacity of working memory. Future research to better understand how presentation style and music can influence working memory – as well as learning – is warranted.
