Abstract
Ensemble musicians must navigate a complex sensory environment to produce cohesive performances. The purpose of this study was to examine pulse alignment among performers of different experience levels facing increasingly asynchronous auditory and visual information. Musicians (N = 51) who were current members of large instrumental ensembles watched a video of a conductor outlining a 4/4 pattern while hearing a multivoiced instrumental ensemble soundtrack and were asked to tap the pulse on a tablet-based pad. Each of nine examples was presented in one of three experimental formats: control (steady audio and video), audio (ensemble) accelerating/video (conductor beat pattern) decelerating, and video accelerating/audio decelerating. Rate of pulse change was ±7.5% with initial tempos of 108, 127, and 146 bpm. Data consisted of (1) deviations (ms) from a consistent IOI (steady pulse) and (2) mean deviations from audio (ensemble) or video (conductor) pulse. In the asynchronous conditions, participants broadly adhered to auditory or visual information rather than to a steady rate of pulse. There was no significant difference between information stream preference. Experience was a significant factor in audio information deviations; more experienced performers found audio information to be a more salient reference point, consistent with results reported for less contextualized timekeeping tasks.
Keywords
In most large ensemble musical contexts, musicians are expected to perform in synchrony with each other. Keller et al. (2014) described ensemble musicians as co-performers who coordinate their body movements to produce synchronous sounds and interlocking patterns, in which separate instrumental parts articulate different but complementary rhythms. To produce a cohesive ensemble sound, ensemble musicians attend to their own actions and those of others while concurrently monitoring the overall integrated ensemble output. These actions include the physical performance of an instrument, listening to the sounds produced by themselves and their co-performers, and watching the movements and expressions of the other members in the ensemble including those of the conductor. Ensemble musicians then evaluate the level of coordination between the various parts, and actively respond to this wealth of information with adjustments to the production of sound. Ensemble performance, therefore, is an active and dynamic process that requires the musician to be self-aware, as well as sensitive to the other performers, and to constantly monitor and adjust their own performance in response to surrounding activity.
Keller et al. (2014) stated, “temporally precise rhythmic interpersonal coordination requires three core cognitive-motor skills: anticipation, attention and adaptation” (p. 2). These skills are influenced by regulatory strategies that include entrainment (the synchronization of movement to externally perceived rhythm) (see Clayton, 2012), prioritization of the focus of attention between self and others, collective adaptation (adjustment of one’s performance), and phase correction. Phase correction is an automatic cognitive process that adjusts the alignment of pulses generated by an internal timekeeper in one individual relative to those generated by another individual.
In much research on group musical timekeeping, the relevant sensory information experienced by the performer—auditory, visual, and motor—is in agreement, therefore giving stable and consistent reference points to support ensemble synchrony. Most ensemble musicians can anecdotally attest that such a situation does not always occur. When multiple cues are available that define a single underlying beat, humans combine the sensory information to estimate the beat structure (Honisch et al., 2016). In other cases, the inherently complex nature of music performance can cause “sensory blocking,” which overrides some senses in favor of others (Fredrickson, 1994). For example, in a highly technical passage of music, a performer can become focused on the physical movements of their hands and body, effectively ignoring what else is being heard, and not “see” anything beyond the music on their stand. This can result in playing out of time with the rest of the ensemble. Or, in a passage where note durations are long and sustained, a situation that is neurologically not optimal for precise timekeeping (Hove, Fairhurst, et al., 2012), the performer may lose the sense of rhythmic subdivision and alter tempo.
In conducted ensembles, there is an expectation that the visual reference point outlined by the conductor is “correct.” This notion is so entrenched in contemporary ensemble performance practice that the instruction to “watch the conductor” is firmly embedded in principles of rehearsal and performance pedagogy. Colley et al. (2020) demonstrated visual information provided by the conductor improved synchronization of multiple performers with the music. They also determined compatible audio-visual cues can improve intentional synchronization. However, when faced with unstable ensemble co-performers who are prone to tempo drift, a musician’s optimal strategy may not involve watching the conductor (Fairhurst et al., 2014). Rather, the performer may adopt a localized leadership role by assuming responsibility for tempo and resisting adaptation to a co-performer’s irregular timing.
Neuroscientific research on perception has found that acquiring and maintaining action-outcome relations engages specific, integrative cognitive mechanisms for perceptual processing of external events (Desantis & Haggard, 2016; Maes et al., 2014). Due to differences in the physiological and neural processes and pathways associated with light and sound energy, sound and vision are processed by the brain at different speeds resulting in shorter reaction times to auditory stimuli than to visual (Repp & Penel, 2002). In the same study, they determined that auditory information was favored over visually presented stimuli in group synchronization tasks. These differences were also observed by Noel et al. (2015) in their examination of synchronization and beat perception. In their 2000 study, Jäncke et al. discovered auditory stimuli induce an internal rhythm that guides movement, whereas visual stimuli do not generate such an internal rhythm.
Thompson et al. (2018) uncovered an asymmetric sensitivity in aural interactions in large groups. They determined that people were more sensitive to events happening in one direction of temporal compression/decompression (before or after their own sound). In their study, which examined elements of group clapping, they found individuals responded more strongly to match neighbor claps that preceded their own clap than those that followed, suggesting a natural tendency to want to “catch up.” This created a multiplicative effect of inter-individual interactions which ultimately caused the overall tempo of the claps to speed up.
Socio-cognitive behaviors employed to maintain or achieve synchrony may also be influenced by experience. Neural networks are strengthened and become more complex through repeated experiences (Margulis, 2014), and expert musicians have been shown to have greater neural development in areas related to self-other constructs, empathy, and interpersonal awareness (Keller et al., 2014). Palmer (1997) stated that in music performance, motor systems construct the information for upcoming movements based on internal concepts of time. These temporal concepts can be trained through repetition; when musicians practice, they establish a connection between a steady pulse and the muscle coordination and sequence of activation required to perform their part. Practice, therefore, reinforces a musician’s understanding and physical sensation of the music and trains the muscles of the body to perform a particular movement in a particular manner. These physical experiences of tempo can then influence ensemble synchrony: a study of piano duet partners whose individual solo practices were more similar in tempo could synchronize quicker and better than those who had different concepts of tempo (Loehr & Palmer, 2011).
In ensemble performance, however, physical training is not isolated from visual and auditory informational associations. In studies on piano-piano, piano-violin, and violin-violin duos, Bishop and Goebl (2017) found the highest rate of synchrony between pairs of violinists, likely due to greater experience in ensembles. Goebl and Palmer (2009) determined head and finger movements of pianists performing duets became bigger and more synchronized as audio information was reduced. Although the overall performance accuracy was negatively affected as audio information was decreased, this reliance on visual cues to promote synchronous performance suggests the importance of a visual reference in group music-making, particularly when clear auditory information is lacking. Indeed, removing both visual and auditory information negatively affected a performer’s ability to align with other members of the ensemble, which indicates that motor information is insufficient to inform and maintain pulse in a large ensemble setting (Fredrickson, 1994).
Habets and colleagues (2017) determined that learning and experience resulted in a higher probability of perceiving audio-visual stimuli as simultaneous when an audio-visual combination had previously been encountered compared with non-learned combinations. This difference in perceived synchrony suggests repetition in music performance (i.e., rehearsal) trains the brain to expect a certain congruence of audio-visual information and sequence within multisensory temporal events. These actions may specifically bind the motor/sound information as the expected consequence of an audio-visual cue. This expectation of “how a piece always goes” may interfere with the brain’s ability to interpret and adjust to unexpected asynchrony.
Purpose
Conducted music ensembles are a context in which both auditory and visual information is available to support group synchronization. As such, the introduction of conflicting temporal information between these two information streams may provide insight into the relative salience of each. The purpose of this study was to examine the influence of conflicting auditory and visual information on an individual’s effort to synchronize their performance within the context of the conductor-ensemble-performer relationship, and the degree to which this effort varied according to previous music experience. Given that previous research offers evidence for the importance of repetition and the establishment of expectations of synchronous performance, and that the salience of audio or visual information was influenced by context, we tested for musicians’ responsiveness to audio, visual, or neither stimulus, and analyzed for associations between direction of stimulus change (increase or decrease from a steady tempo) and years of experience. Increasing asynchrony is defined as visual and auditory information becoming concurrently and oppositely farther from synchronous (auditory pulse information increasing while visual pulse information decreases and vice versa).
Our research questions were: (1) Given situations of increasing audio-visual asynchrony, do musicians demonstrate a tendency to align to either an audio or visual pulse, and (2) if so, does the accuracy of preferred pulse alignment vary with years of experience after controlling for direction of change?
Method
To test pulse alignment among instrumental musicians in the presence of asynchronous audio and visual information, we used a repeated-measures within-subject design. Participants (N = 53, 26 females, 27 males) were selected using convenience sampling of musicians in the metropolitan area of a large city in the Pacific Northwest region of the United States. All were adult musicians currently playing either a wind (n = 48), string (n = 2), or percussion (n = 3) instrument in a large, conducted ensemble (both band and orchestra n = 25, band only n = 28). We operationally defined a large ensemble as comprising 25 or more players. Participants ranged in age from 18 to 63 years (M = 24.96, SD = 9.94) and reported from 1 to 53 years of performance experience (M = 13.61, SD = 10.07).
The participants individually viewed a video of a conductor on a monitor while hearing a recorded ensemble through stereo speakers. The conductor faced forward, as would be seen from the perspective of the majority of study participants, and conducted the excerpt heard on the audio. Participants were instructed to tap along with the perceived pulse using a touch pad. This tapping task occurred across two varying audio-visual conditions: one in which audio gradually increased in tempo by 7.5% while video decreased in tempo by 7.5% (A+/V−), and a contrasting condition in which video gradually increased in tempo by 7.5% while audio decreased in tempo by 7.5% (A−/V+). The total offset of 15% is a lag rate consistent with that utilized in previous research on conductor/ensemble synchrony (Meals et al., 2019). In a third condition, unaltered fully synchronous excerpts were used as a control.
To ensure all selected participants could accurately perform the task, we included a screening item prior to the sequence of conductor/ensemble items. The screening item consisted of a MIDI generated audio-only click track followed by a silent video of a conductor conducting a simple, non-expressive pattern at the tempo of 120 bpm. Participants were trained on the use of the tap interface and given an opportunity to practice the tapping motion prior to the start of the screening task. This was to prevent errors introduced through long presses on the touch pad, or multiple taps from “jitters.” Two participants were unable to accurately complete the screening task and their data were subsequently not included in the analyses. All procedures were carried out with the approval of the university’s Institutional Review Board.
Experimental media
Audio
We used three different harmonized excerpts as audio stimuli. Each excerpt featured a five-part polyphonic texture in 4/4 meter adapted from Renaissance dances and folk melodies. Excerpts were 32 counts (8 measures) in length. This length of excerpt was chosen specifically for brevity; we anticipated participant fatigue over the course of the task, especially considering the cognitive demands of asynchronous conditions. It still allowed for common phrase lengths and provided enough data points for our statistical analysis. Excerpts did not include percussion and did not include an ongoing performed pulse (i.e., quarter note ostinato or long strings of eighth notes) among the ensemble parts. We entered the excerpts into the music notation program Finale 25 (see Supplementary Materials online for scores) from which MIDI audio files were generated. MIDI samples included flute, clarinet, alto saxophone, bassoon, trumpet, French horn, trombone, tuba, violin, viola, cello, and double bass, and were chosen to provide a balanced mix of string and wind timbres. Parts and doubling were assigned using traditional voicing practices: flute, clarinet, trumpet, and violin “played” the soprano part; clarinet, alto saxophone, and violin played the alto part; French horn, bassoon, and viola played the tenor part; trombone, bassoon, and cello played the baritone part; and tuba and double bass played the bass part. The starting tempos of the three excerpts were 108, 127, and 146 beats per minute. These tempos were chosen so they remained within Westergaard’s (1975) determined range of “useful tempos” (60–160 bpm) for beat perception in which listeners are able to resist grouping or subdividing beats once altered. Audio excerpts were altered to conform to each of the three conditions: synchronous (control), accelerated (+7.5% of tempo), and decelerated (−7.5% of tempo), resulting in a total of nine excerpts.
Audio used for the study was generated using Adobe Audition to gradually stretch or compress the playback rate over the course of the eight-bar excerpt. The first measure of each excerpt was unaltered to allow participants to establish an initial pulse. This was intended to “trigger” existing performance expectations and behaviors. Measures 2–7 of each excerpt were gradually altered in an evenly distributed manner, ±1.25% per measure. The final measure remained steady at the maximum ±7.5% alteration rate. Reference click tracks were also generated using this same method. Audio music and audio reference click tracks were then combined into a single “mixdown” track to ensure accuracy of the component alignment; this allowed us to determine with full confidence that the audio heard by the participants featured the identical rate of change as those of the reference click tracks that were used in the data analysis. The final audio track used for the stimulus had the reference click track muted—the embedded reference was used for interonset interval (IOI) detection in the analysis phase as its waveform was composed of discrete events (i.e., spikes) rather than the blunted waveforms of sustained sounds.
Video
Video media featured one of two experienced conductors (one male and one female, both of whom were active conductors of large bands and orchestras in the study area), each conducting all nine excerpts. Conductors beat a simple four-pattern to the reference click track for each condition: synchronous, accelerating, and decelerating. The reference click tracks were identical to the ones used to build the audio portion of the study media. Conductors were instructed to use the baton only, conduct on the beat (not behind or ahead), keep gestures clean and non-expressive, and maintain a neutral but pleasant facial expression. Video was recorded using a Sony HD handheld video recorder. We examined each conducting video alongside the reference click track to ensure clicks were aligned with the bottom of the gesture. This was a purposeful decision made for ease of analysis. Recognizing the concept of “beat bins’’ in which the precise perception of a pulse varies with context (Danielsen et al., 2019) and that maximum synchrony within conducted ensembles appears to take place as the baton is moving at “peak velocity” (Luck & Toiviainen, 2006), the accepted performance practice in both wind ensemble and orchestral settings places this point at the deepest part of the gestural curve. We included these concepts in the analysis of the video alignments, expecting to see a consistency of “lag” (Luck & Toiviainen, 2006) across participants’ responses.
We then transferred the video files into Adobe Premiere Pro and paired them with the audio files to create 18 items: 2 conductors × 3 excerpts × 3 conditions (synchronous, A+/V−, and A−/V+) for a total of 18 audio/video pairings. From this set of stimuli, two experimental sets of 9 items were created to control for any potential bias arising from particular conductor/excerpt pairings; each experimental interface began with a different conductor and alternated between conductors across ensuing items.
We set the item order to start and end with an excerpt in the unaltered synchronous condition, with the remaining seven excerpts alternating between the different tempos in such a way that no two starting tempos were sequential, and no condition (i.e., increasing/decreasing video) was consistently associated with the gender of the conductor. The two sets of items were collected into two continuous presentation formats using Apple Keynote. Each item was separated by a 6-second rest break. Approximately half of the participants (n = 29) viewed the female conductor first and the remainder (n = 24) viewed the male conductor first. Order of the tempos and condition was identical within each presentation. We requested participants to tap along with what they perceived to be the pulse of each item on an iPad using the “acoustic drums” setting in GarageBand. Audio feedback from the iPad was muted to ensure participants were only hearing one source of audio information. Tap sessions were recorded using the internal recording setting of the GarageBand app, then downloaded into Audacity for analysis preparation at the conclusion of the test administration.
Data collection
Raw data consisted of timestamps for each participant’s taps. With nine test items, each of eight measures duration, each participant was expected to generate 32 data points (taps) per excerpt. We examined the data to identify dropped beats, excess beats (e.g., an extra tap or taps past the end of the excerpt), or other extraneous responses (e.g., stuttered taps, subdivisions, or taps that did not clearly register on the touch pad). Dropped beats were coded as empty data points to maintain accurate correspondence between participant taps and the audio/video reference clicks. We examined the location and amplitude of taps identified as “stutters” to extrapolate the participant’s intended response. If a clear point was determined, it was then entered as the data point for the corresponding stimulus pulse. If there was no clear point, we entered the point as indeterminate, the value of which did not go toward the final analysis. Extra taps, such as those that clearly indicated a doubling of time (i.e. subdivision) as well as those performed beyond the conclusion of a given excerpt, were flagged and removed from the data set. Of the 51 participants, only two required any significant manipulation. In one case, the participant clearly double timed one excerpt, and in the other, the participant half timed the initial excerpt. Data cleaning decisions were informed by our interpretation of the entire data set, real-time observations of the participant while they completed the task, and our expertise in the field. Although “stutter taps” and isolated incidences of subdivision may include information about a participant’s beat perception and sensorimotor entrainment (see Ruiz et al., 2011), the examination of such is outside the scope of this study.
Using Adobe PremierePro, a single reference click track was extracted from each of the stimulus interfaces, then labeled video and audio, respectively. These complete reference tracks were then run through a Python script to determine exact timestamps of event onsets which then allowed for calculation of interonset intervals (IOI), in milliseconds, between each click within and across all nine excerpts. Using Audacity, we examined each participant’s tap data against the audio and video reference tracks. Knowing that synchronization is worst at the beginning of a piece (Bishop & Goebl, 2017), we looked for a single point within the initial synchronous control excerpt that had the most accurate alignment between both reference tracks. We determined this point to be Excerpt 1: bar 5 beat 4 (event 20). The average deviation across all participants was 23 ms. This “zero-point” functioned as a collective starting point for the analysis. Because Excerpt 1 was a synchronous control, and all tap data were collected without breaks in recording (i.e., one continuous soundtrack was collected for each participant), we believed this was an acceptable means to align reference and response tracks. Participant data and reference tracks were then trimmed prior to running through the Python script to reflect this new starting point, reducing the total number of data points per participant from 288 to 269. We acknowledge this point is likely to be idiosyncratic to these participants and further exploration of this phenomenon is warranted.
Each trimmed participant file was then run through the same Python onset detector script as the reference tracks to produce a .txt file with the time point of each event (tap) generated by the participant. As each participant track (269 taps) was analyzed against both the audio and video reference tracks, this gave 538 data points per participant. We calculated IOIs for each excerpt for each participant; these data were then entered into Google Sheets and cleaned of transcription or translation errors (i.e., placeholder cells entered as 0). We calculated maximum/minimum values and mean deviations from the reference tracks for each participant, and members of the research team checked these values for outliers as a secondary verification strategy. We analyzed our data using frequentist statistics with a significance threshold of α = .05, using IBM SPSS Statistics (version 27).
Results
Audio and video responders
We conducted an initial analysis of the data to determine preference for audio or visual information, which we interpreted as the information stream (audio, visual) to which participants demonstrated a stronger (more frequent) positive relationship. Once timestamps and IOIs had been generated for the audio and video stimuli streams as well as for participant responses, we sought to examine the relationships between the stimulus and participant data. To do this, we needed to generate values that were both comparable across data sets (audio clicks, video clicks, participant taps) and reflective of the tempo manipulations (audio faster/slower, video faster/slower) within the research design. Thus, we compared actual IOI values (in milliseconds) against the presumed IOI value of a steady, synchronous performance. In other words, we calculated the time between audio, video, and participant beats, then calculated the difference between these time intervals and the intervals we would have expected had the audio and video information remained synchronized at a steady tempo.
Positive DevSync (Deviation from Synchronized) values indicated a shorter than expected deviation, a smaller IOI characteristic of a faster or accelerating tempo. Negative DevSync values indicated a longer than expected deviation, a larger IOI consistent with a slower or decelerating tempo. Although the location of tapped responses to isochronous musical beats has been found to vary by both participant and musical quality (e.g., tempo, duration), and tends to be slightly behind identified beat centers (Danielsen et al., 2019), in the absence of a specific non-zero predicted value and in the interest or analytical consistency we used a value of 0 as the expected difference between stimulus and participant IOI.
For each participant, we calculated correlations between response DevSync values and the DevSync values for the audio and video stimuli for each of the nine excerpts. We then segregated participants according to the preponderance of the direction of the correlations: Those for whom more than 50% of their responses to six manipulated items correlated positively with the audio stimuli were labeled “Audio Responders” (n = 19) and those for whom the majority of their responses correlated positively with the visual stimuli were labeled “Video Responders” (n = 27). Using a significance threshold of α = .05, distribution of group membership was not significant, χ2(1, 46) = 1.39, p > .05. Five participants had an equal number (3) of positive audio and video correlations and were not included in either category.
We plotted mean DevSync values (ms) for Audio and Video responders alongside the corresponding measures of the audio and video stimuli (Figure 1(b) and (c)). Both groups demonstrated evident and consistent adherence to their preferred information stream. To examine and compare responses in greater detail, we plotted responses of both groups together for each of the six altered excerpts (Figure 2). Overall, an increase in the clarity of response was evident as the temporal gap between the audio and video information increased.

Deviations (in Milliseconds) From Steady Synchronized Tempo. Positive Values Indicate Faster Tempos: (a) IOI Deviations of Video (Conductor) and Audio (Ensemble) Stimuli for Each of Nine Excerpts Arrayed From L to R. (b) Mean Deviations From Synchronized (DevSync) for Participants Categorized as (b) “Video” Responders and (c) “Audio” Responders Plotted Alongside Stimuli Deviations.

Mean Deviation From Synchronized (DevSync) for “Audio” (Gold Solid Line) and “Video” (Dashed Line) Responders by Condition and Excerpt Tempo.
Music experience
To examine whether responses were related to participants’ prior music ensemble experience, mean absolute deviations across all trials were calculated for each participant relative to both audio and video in each manipulation condition: audio increasing, audio decreasing, video increasing, and video decreasing.
Deviation scores were standardized (converted to z scores) prior to calculating mean scores to allow for comparable IOI magnitude scales across the three tempos. Because the audio and video deviation scores were mathematically related (i.e., audio deviations increased/decreased in the same proportion as the magnitude of video decrease/increase), we elected to use the two audio mean standardized absolute deviation scores as outcomes in a pair of multiple linear regression analyses with sequential predictor entry (α = .05). We examined normality, linearity, and homoscedasticity of residuals for each model confirming that linear regression assumptions were met. With 51 participants, results must be interpreted with caution and cannot be generalized beyond this non-random sample.
We viewed prior music ensemble experience as a foundational factor upon which participants completed the present tasks, thus it was entered in Block 1. To account for general tendencies that participants exhibited throughout the research task, in Block 2 we added two effect-coded variables that indicated each participant’s overall response tendency (audio responder, video responder, no tendency), as well as standardized mean absolute deviation for synchronous (unaltered) items as a baseline indicator of general beat-keeping accuracy. Finally, in Block 3 we effect-coded the direction of participants’ mean deviations for the three audio increasing or audio decreasing items matching the direction of the model outcome. The models were as follows:
Model 1 (Audio increasing):
Model 2 (Audio decreasing):
where zAbsAudIncrDev = standardized absolute mean deviations from audio when audio tempo was increased; zAbsAudDecrDev = standardized absolute mean deviations from audio when audition tempo was decreased; YearsExp = years of ensemble experience; RespTendEffAudio/Video = categorical response tendency as more frequently positively correlated with audio (effect-coded as 1) or video (effect-coded as 0) stimuli; RespTendEffNone/Video = categorical response tendency as equally distributed between audio and video (effect-coded as 1) or more frequently positively correlated with video (effect-coded as 0) stimuli; zAbsBaseline = standardized absolute mean deviations on synchronous items (without tempo alteration); AudIncrDirEff = direction of standardized absolute mean deviations for items in which audio tempo was increased (effect-coded −1 = lag, 1 = ahead); AudDecrDirEff = direction of standardized absolute mean deviations for items in which audio tempo was decreased (effect-coded −1 = lag, 1 = ahead).
In the models above, the magnitude of standardized absolute deviations from audio in either the increasing tempo or decreasing tempo conditions (zAbsAudIncrDev or zAbsAudDecrDev, respectively) is equal to the conditional mean (b0), plus the unique effect of years of music ensemble experience (b1), the unique effect of overall response tendency of audio versus video (b2) and no tendency versus video (b3), magnitude of deviation for synchronous items (b4) and the unique effect of direction of deviation from audio increasing or decreasing items (b5).
Audio increasing
Previous music ensemble experience was not a significant factor related to accuracy of responses to excerpts during which the audio tempo increased. When factoring in baseline accuracy for synchronous items and response tendency for audio versus video in Model 2, participants who, in general, tended to focus on the audio stimuli demonstrated significantly greater accuracy. However, this was mitigated by the direction of their responses to each specific item, a factor we added in the final model, F(5, 45) = 3,76, p ⩽ .01. The final model, overall, accounted for 22% of total variance (Table 1). Participants whose responses to the specific items were ahead of the audio were significantly more accurate with a 200 ms mean decrease in absolute deviation scores.
Multiple Linear Regression for Standardized Absolute Deviations of Responses to (a) Audio Increasing (+) and (b) Audio Decreasing (−) Excerpts.
p ⩽ .05. **p ⩽ .01. ***p ⩽ .001.
Audio decreasing
Among excerpts in which the audio tempo decreased, previous music ensemble experience emerged as a significant (p ⩽ .05) factor in participants’ accuracy and remained so with the subsequent addition of deviation from synchronous items, audio/video response tendency, and item response direction. Each year of music experience was associated with a 11 ms increase in response deviation. This suggests a resistance to slowing ensemble tempo, or “dragging,” and a reliance on other mechanisms of time keeping than what is being provided by fellow ensemble members. In addition, participants who, in general, tended to focus on audio stimuli were significantly (p ⩽ .001) more accurate by slightly under half a standard deviation (Table 1). Unlike excerpts in which audio tempo increased, specific item response direction for audio decreasing excerpts was not a significant factor; Model 2, in which this variable was not included, emerged as a slightly better fit that the full model accounting for 35% of total variability, F(4, 46) = 7.76, p ⩽ .001).
Discussion
In this study, we explored the effects of increasing audio-visual asynchrony on ensemble musicians’ ability to synchronize with others, and the relationship between experience and pulse alignment. Participants were presented with visual information in the form of a video of a conductor and audio information in the form of a winds and strings soundtrack. During test conditions, participants were faced with increasing asynchrony between audio and visual information; in half of the cases the audio sped up while video slowed, and for the other half video sped up and audio slowed.
Although more participants demonstrated greater alignment with the video stimulus, meaning a smaller mean deviation from the visual pulse, the difference in numbers of participants who were more responsive to audio and those who were more responsive to video was not significant. This suggests a relative salience of both streams of information, unsurprising given the multisensory nature of ensemble music performance. Interestingly, when asked what participants consciously followed (or thought they followed), 19 participants said they actively followed the audio stream, 26 said they followed the conductor, and 8 said they flipped between streams in response to which was going faster. This relative balance of responses seems to contrast with prior research reporting the primacy of audio information over that of visual information in contexts that require motor response (Hove, Fairhurst, et al., 2012). However, this result does align with the findings of Colley and colleagues (2020) where synchrony was improved when participants could see the conductor and with the convention of “watching the conductor” for pulse information. Participants who aligned more closely with the video pulse appeared to be similar in accuracy to audio responders in adapting to changing tempos, a surprising finding given the inherent variability of visual beat perception, relative differences between auditory and visual neurological processing speeds, the importance of auditory imagery, and anticipatory processes (e.g., Danielsen et al., 2019; Desantis & Haggard, 2016; Keller & Appel, 2010; Keller et al., 2014; Luck & Toiviainen, 2006; Pecenka & Keller, 2009).
We observed faster adaptation and realignment to the audio stimulus in increasing tempo conditions; although we did not include except tempo in the analysis, response patterns suggest that this may have been particularly evident with the fastest condition (146 bpm) (Figure 2). This is consistent with Repp’s (2003) finding that stable synchronization is possible at much faster rates with auditory rather than with visual sequences. This also aligns with Thompson and colleagues’ (2018) research into group synchronization with audio cues and the collective instinct to “catch up.”
Participants in each experimental condition appeared not to align to a particular source of information until such a point as the two source streams—the conductor and the ensemble—presumably became noticeably distinct. Human tolerance to audio-visual binding, that is, the perception of sound and visual events taking place simultaneously, is narrow—binding thresholds, or just noticeable differences, are mere 65 ms when audio precedes video and 112 ms when sound follows video (Lewkowicz, 1996). Among these participants, when audio stimulus was ahead of the video, the mean response did not appear to align to a particular stream until the difference was approximately 60 ms; when video was ahead of the audio, this was not evident until the difference was approximately 100 ms, consistent with the parameters described above. Research examining responses within, outside, and across these boundaries would provide additional evidence concerning how performers navigate the presence of one or more than one source of beat information.
This tendency to “split the difference” between audio and visual information before it was clearly un-bound is also in line with the findings of Honisch and colleagues (2016) in which cue integration resulted in reduced asynchrony variability: participants seemed to optimize their performance to minimize variability in timing errors between external cues and their own taps. However, this resulted in increased variability within an individual’s own movements. This strategy was employed until phases between the cues became so large as to make integration impractical at which point the participant switched strategy (i.e., made a choice to align with audio or video) to minimize their own asynchrony. Pecenka and Keller’s (2011) action simulation theory suggests that individuals draw on their own motor system to run internal forward models that simulate the timing of another individual’s actions and thereby predict its future outcome. Even when ensemble co-performers are “playing badly,” it appears not to disrupt the individual’s ability to synchronize with the conductor and music (Colley et al., 2020).
Of particular interest was the influence of years of experience in conducted ensembles on overall accuracy of pulse alignment. The average years of experience in this sample was 13 years, 8 months with a range of between 1 and 53 years, and a standard deviation of 10 years, 1 month. Low experience (−1SD) was approximately 3.5 years and high experience (+1SD) was approximately 23.5 years. Years of experience was only a significant factor in the case of decreasing audio stimulus rate. The more experienced participants were less accurate in the positive direction, essentially staying closer to the steady tempo indicated at the outset of each excerpt. This suggests musicians with more experience have developed a stronger sense of internal pulse and sensitivity to motor feedback which is more resistant to dragging (i.e., they were better able to maintain the initial pulse in spite of asynchronous conditions). This sensitivity to and reliance on “muscle memory” has been demonstrated to develop and improve with time (Pecenka & Keller, 2011). Experienced musicians may also employ a form of selective sensory blocking, a variation on the sensory blocking described by Fredrickson (1994), in this case purposefully disregarding conflicting sensory information in favor of what is neurologically primed—muscle memory associated with pre-established pulse, performance practice, and partially symmetrical entrainment (Clayton, 2012) that favors the conductor. We speculate, therefore, that more experienced musicians can quickly identify and combine different streams of sensory information, evaluate if this information meets what is anticipated/predicted, and consciously move between or ignore different sources of pulse information to initially maintain, then adapt their pulse to that of the ensemble (see Keller & Appel, 2010; Keller et al., 2014).
In general, the strongest predictor of overall pulse accuracy was the mean direction of the participants’ taps in the audio conditions: participants who were behind the beat were significantly closer to the reference (i.e. responding quickly to what they hear) than those who were ahead of the beat in increasing audio pulse conditions (i.e., anticipating the beat was less accurate than responding). Although being behind the beat in synchronous auditory conditions may be interpreted as “playing late,” in conditions of unexpected tempo changes, participants who were behind the beat ended up closer to the reference stimulus as tempo sped up. This aligns with the research of Thompson and colleagues (2018) who determined that humans instinctively speed up their pulse action to catch up with an increasing rate of audio stimulus. Likewise, Desantis and Haggard (2016) described shorter reaction times to auditory stimuli than to visual (Repp & Penel, 2002) and Jäncke and colleagues (2000) determined that auditory rhythms induce an internal rhythm that guides movement.
Equally, participants’ taps became farther from the reference when the audio was decreasing in tempo (i.e. maintaining the originally established pulse even as the current pulse becomes noticeably slower). As the experimental conditions were asynchronous and both stimuli were accelerating and decelerating in opposite directions, these results suggest the increased lag time in the audio decreasing condition may be a result of entrainment, cognitive dissonance associated with the experienced asynchrony, and reliance on practiced behaviors—that is, motor memory that is resistant to changing tempo, or the participant taking extra time to sort out the sensory information before making a performance decision (e.g. Fairhurst et al., 2014; Honisch et al., 2016; Hove, Iversen, et al., 2012; Keller & Appel, 2010; Keller et al., 2014).
Although our findings demonstrate visual information does not elicit the same level of precision in performance as audio information, which is in line with neurocognitive research on audio processing speeds and visual pulse perception (Danielsen et al., 2019; Desantis & Haggard, 2016; Hove, Fairhurst, et al., 2012; Jäncke et al., 2000), it suggests that the actual presence (conductor) or implicit presence (ensemble) of other individuals supported participants’ abilities to synchronize. This aligns with the findings of Colley and colleagues (2020) who determined having a common visual cue that contains information about upcoming time intervals can benefit a group, and that of Miyata et al., (2017) who found shared visual information between performers compensates for individual variations in coordination and promotes cohesive performances. As such, particularly among large groups of performers, audio information may not be maximally effective as the sole source of pulse if the performance is to remain synchronized. This ultimately suggests that conductors may be well served to attend to the manner in which their movements correspond to the musical pulse and, in the case of conductor and educators working with developing ensembles, to promote multisensory perception skills in their musicians that support accurate pulse alignment in changing conditions.
A limitation across much of the extant research into synchrony is that of the (necessary) lab setting. It is difficult to definitively say whether participants in these research settings would respond in identical manners in natural, ecological settings. In our study, we simulated the visual and auditory environment of a large ensemble rehearsal, but not the social one. Sociological research has shown in cases where individuals within the group are out of sync with the majority of performers (e.g., they are the only ones responding to the conductor), the optimal action for group cohesion is to adjust their behavior to regain synchrony (Collins, 2004). Further study of multisensory focus of attention and corresponding action in ecological settings is warranted.
This data set reflects only the fastest of the intended tempos for the study. Due to a data corruption issue and limitations in finding new participants within the time frame of the study, a full analysis of tempos below 100 bpm was not completed. Repetition of this study with slower tempos will enhance the results and understandings of this study. The limited sample size may have obscured smaller effect sizes. As such, repeating this study with a larger sample size or a between-subjects design may allow for detection of more subtle effects and patterns of behavior. Although the sample did include a sizable variability in age and experience, it was drawn from among players in collegiate wind bands and, as such, likely included players whose type of musical experience and instrumental instruction was similar. These findings cannot be generalized to other populations including those whose music ensemble experiences occurred within different music styles or traditions. Further research into audio-visual asynchrony would be of benefit, particularly with different ensemble types, seating positions, age groups, and instructional experiences, to examine potential variability among mechanisms employed by musicians to navigate the multisensory environment of the large instrumental ensemble. Similarly, an exploration of sensory preference in differing conditions of technical or written difficulty would serve to enhance our understanding of the cognitive strategies utilized by ensemble musicians during live performances.
Supplemental Material
sj-pdf-1-pom-10.1177_03057356231153064 – Supplemental material for Focus of alignment and performance accuracy among wind band musicians in situations of audio-visual asynchrony
Supplemental material, sj-pdf-1-pom-10.1177_03057356231153064 for Focus of alignment and performance accuracy among wind band musicians in situations of audio-visual asynchrony by Taina Lorenz and Steven J Morrison in Psychology of Music
Footnotes
Acknowledgements
The authors thank Elizabeth McDaniel and Philip Tschopp for their invaluable assistance with the creation of the conducting videos. This research was carried out on the unceded land of the Coast Salish peoples, land which touches the shared waters of all tribes and bands within the Duwamish, Suquamish, Tulalip and Muckleshoot nations.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
