Abstract
BACKGROUND:
The effectiveness of music-based interventions (MI) in autism has been attested for decades. Yet, there has been little empirical investigation of the active ingredients, or processes involved in music-based interventions that differentiate them from other approaches.
OBJECTIVES:
Here, we examined whether two processes, joint engagement and movement, which have previously been studied in isolation, contribute as important active ingredients for the efficacy of music-based interventions.
METHODS:
In two separate analyses, we investigated whether (1) joint engagement with the therapist, measured using a coding scheme verified for reliability, and (2) movement elicited by music-making, measured using a computer-vision technique for quantifying motion, may drive the benefits previously observed in response to MI (but not a controlled non-MI) in children with autism.
RESULTS:
Compared to a non-music control intervention, children and the therapist in MI spent more time in triadic engagement (between child, therapist, and activity) and produced greater movement, with amplitude of motion closely linked to the type of musical instrument.
CONCLUSIONS:
Taken together, these findings provide initial evidence of the active ingredients of music-based interventions in autism.
Introduction
For decades, people have shared the observation that music-based interventions (MI) are effective in improving social communication and motor skills for a range of populations (Alvin, 2000). In the past few years, we have stronger empirical evidence of this, with studies showing improved social and sensorimotor skills in neurotypical individuals following music-making with another individual (Jancke, 2009; Kirschner & Tomasello, 2010). These findings have been supported by randomized controlled trials showing improved social communication in neurodevelopmental disorders such as autism (Geretsegger, Elefant, Mössler, & Gold, 2014), as well as in a number of neurological and psychiatric disorders (Särkämö, Altenmüller, Rodríguez-Fornells, & Per-etz, 2016; Sihvonen et al., 2017).
Autism 1 is a neurodevelopmental disorder characterized by difficulties with social communication and social interaction, as well as the presence of restricted and/or repetitive behaviors and interests (American Psychological Association, 2013). Autism includes a heterogenous profiles of clinical presentation, etiology and underlying neural connectivity (Muhle, Reed, Stratigos & Veenstra-VanderWeele, 2016). In autism, MI has been shown to be particularly effective because it appeals to the musical interest of many individuals with autism (Sharda et al., 2019) while being inherently interactive, alleviating core social impairments that are defining features of autism (Srinivasan & Bhat, 2013). Musical experiences, such as those found in music-based interventions, significantly influence the development of a wide range of skills and abilities in children with autism, including language & communication (e.g., Gold, Wigram, & Elefant, 2006), social-emotional understanding (e.g., Overy & Molnar-Szakacs, 2009) and motor imitation (e.g., Stephens, 2008), and enhance multimodal linkages within several brain networks (Phillips-Silver, 2009). In a recent randomized controlled trial (RCT) comparing a music-based intervention to a non-music, play-based control intervention, we showed that MI can significantly improve parent-reported social communication skills, as well as auditory-motor brain connectivity in school-age children with autism (Sharda et al., 2018). How are these changes taking place? What are the “active ingredients” or mediators of outcomes (Vivanti, Prior, Williams & Dissanayake, 2014) of music-based interventions that lead to these benefits, over and above other treatments of similar intensity? Identifying the specific “active ingredients,” that is, the mechanisms or processes through which a treatment exerts its therapeutic cha-nges, has become an important research goal for the development of both autism interventions and music-based interventions (Ballan & Abraham, 2016; Kasari, Freeman, Paparella, Wong, Kwan & Gulsrud, 2005; Vivanti, Kasari, Green, Mandell, Maye & Hudry, 2018). The current paper sought to identify key processes of music-based interventions that contribute to treatment gains in children with autism.
Previous research has indicated that factors such as therapeutic relationship and attitudes, level of eng-agement, musical reward and movement may mediate the benefits of music-based interventions (Särkämö et al., 2016). For instance, a report by Mössler et al. (2019) demonstrated that individual differences in the therapeutic relationship from an attachment theory perspective, or match between therapist and child’s mode of interacting, were a significant predictor of language and communication outcomes in children with autism. Further, engaging in musical activities has been shown to activate multimodal brain networks including those underlying reward and auditory-motor integration processes, and can lead to structural and functional changes in connectivity, explaining observed cognitive gains in several clinical populations including autism (Särkämö et al., 2016; Sharda et al., 2018).
We investigate two distinct processes, or theoretically-motivated potential ‘active ingredients’, of music-based interventions that may drive positive outcomes, specifically as it relates to individuals with autism. Prior work has indicated that joint engagement (Kim, Wigram, & Gold, 2008), on one hand, and the involvement of movement (Janzen & Thaut, 2018; Phillips-Silver, 2009), on the other, independently play a key role in the efficacy of music-based interventions. However, these investigations have only been conducted in isolation, although it is likely that multiple active ingredients are responsible for the benefits observed. We propose that both of these processes, in concert, may give rise to the positive outcomes of music-based interventions, and thus, we investigate them in parallel within the same dataset. In the current paper, we used the dataset obtained from our previous RCT (Sharda et al., 2018) to investigate how joint engagement (Analysis 1), and movement (Analysis 2), surface during a music-based intervention, compared with a control, non-music play-based intervention (Sharda et al., 2018). In both cases, intervention was delivered one-on-one by a therapist to a school-age child, using improvisational approaches. Sessions were semi-structured across participants, while adapting activities and interaction to the individual’s needs and developmental level (see Geretsegger et al., 2015; Srinivasan & Bhat, 2013). The interventions were very well matched in format and intervention targets (see RCT Study Design section below), allowing us to address our objective of examining two theoretically-motivated active ingredients of the positive outcomes of music-based interventions. Specifically, we asked whether the processes of 1) joint engagement and 2) movement were evidenced to a greater degree in music-based intervention, than in a control non-music intervention. While we did not have any specific predictions about changes over time, we also investigated whether engagement and movement increased from earlier during the intervention compared to later and we examined whether these time-based outcomes varied depending on the type of intervention.
Analysis 1: Joint engagement in music interventions
There are numerous ways in which music has been described as engaging. The first describes the relationship between an individual and music. Playing and listening to music is widely recognized as a naturally engaging, interest-provoking, and intrinsically motivating and rewarding activity across cultures. Research has shown that engaging in music-making activates the reward networks of the brain (Sihvonen et al., 2017) and can modulate arousal and pleasure. In addition, music interventions allow for full participation and expression by populations with limited verbal abilities, since they do not rely on spoken exchange. These populations may respond to music-based interventions more positively than they do to conventional verbal communication-based therapies (Sharda et al., 2019), and may be able to establish rapport with a therapist more easily through music (Hodges & Haack,1996). Individuals with autism in particular have shown a strong interest in music (Blackstock, 1978; Thaut, 1987), which has led to the common use of music in therapeutic applications with this population. For instance, parents of 50 out of the 51 children with autism who participated in our RCT (Sharda et al., 2018) indicated that their child responded to music positively, with 36 families indicating that their child was happiest when listening to music.
A second way engagement has been used in the context of music-based intervention is in reference to dyadic relationships, that is, increased interest in and communication with another person, often observed through joint attention behaviors, such as responding to eye gaze for communication bids, or initiating communicative bids using verbal or nonverbal (gaze, gesture) means. For instance, in a case study with children with autism, Carpente (2016) observed increased interest in caregivers following improvisational music therapy. Similarly, Vaiouli, Grimmet, & Ruich (2015) reported increased attention to the therapist’s face and joint attention behaviors in children with autism following a music intervention.
Here, we focus more specifically on a third sense of engagement, the theoretical construct of triadic joint engagement (attention to and engagement with a person while involved in a joint activity; Adamson, Bakeman, & Deckner, 2004), in a paradigm that directly compares a music-based intervention to a control non-music, play-based intervention condition (Kim et al., 2008).
Triadic joint engagement
Joint engagement is defined as triadic engagement involving two individuals and an event. Joint engagement can be seen as the combination of joint attention (the ability to attend to objects and partners) and joint action (a shared activity which involves an object or event; Girolametto, Verbey, & Tannock, 1994). Previous studies have documented a decrement in joint engagement in children with autism. For instance, Adamson, Bakeman, & Deckner (2009) examined joint engagement in toddlers with typical development, Down syndrome or autism. Their findings suggest that compared to both peer groups, children with autism were significantly less likely to coordinate engagement between an activity and social partner (Adamson et al., 2009). Crucially, music-based interventions may have unique properties that facilitate joint engagement in children with autism, that is observable during the intervention sessions. Children with autism who underwent an music-based intervention using improvisational techniques, as well as a control play-based intervention (n = 10, repeated-measures design), displayed longer durations of both eye contact with their therapist while engaged in activities, and turn-taking episodes during MI (Kim, et al., 2008). Additional video analysis of sessions from this study (Kim, Wigram, & Gold, 2009) demonstrated that children spontaneously initiated engagement with the therapist more frequently during music-based than the control play-based intervention, as well as displayed more joyful affect and sharing of emotional affect with the therapist during music than the control play-based therapy. This study provided a critical first step in directly comparing music-based intervention to a control intervention with respect to joint engagement. The primary coder in these studies (Kim et al., 2008, 2009) was the first author who was aware of the study hypotheses. Given this evidence of the positive effect of music-based interventions, a larger sample size and independent raters blind to study hypotheses are called for to strengthen these findings.
How do we capture joint engagement?
A number of methods are available to measure dyadic interaction and joint engagement (for reviews see Funamoto & Rinaldi, 2015; Leclère et al., 2014). In this analysis, we based our coding on Adamson et al. (2004)’s joint engagement coding scheme given its range of mutually-exclusive engagement states, ranging from object-focused to dyadic and triadic engagement. Theoretical motivation to extend this scheme from reciprocal caregiver interactions in early development to music-based intervention contexts comes from established similarities between the two contexts with respect to attunement, or the interpersonal mirroring of actions in timing, form and intensity, thus regulating sensorimotor and emotional experience (e.g., Malloch & Trevarthen, 2009; Mössler et al., 2019). Adamson et al. (2004)’s state-based scheme was developed for toddlers (18 to 30 months), is sensitive to developmental change (Adamson et al., 2009) and has previously been used with children with autism (Adamson et al., 2009), for whom it has been shown to be sensitive to treatment effects (e.g., Kasari, Gulsrud, Wong, Kwon, & Locke, 2010). We adapted the engagement coding scheme to be appropriate for our music-based intervention setting and for school-age children (See Table 1 for codes employed), similar to Kasari et al. (2010)’s use of macro-categories to examine treatment change. Given the higher developmental level of our sample, we use a four macro-category engagement scheme (Coordinated Joint/Supported Joint/Object/Other) as well as the three macro-category distinction used by Kasari et al. (2010) that combines Coordinated Joint and Supported Joint to get an overall code for these two types of Triadic Joint engagement.
Joint engagement coding scheme (adapted from Adamson et al., 2004)
Joint engagement coding scheme (adapted from Adamson et al., 2004)
Note: The behaviours had to occur for a duration of at least 10 seconds in order to be coded for, with the exception of Coordinated Joint which had a minimum duration of 20 seconds due to its higher joint engagement ranking.
To investigate whether triadic joint engagement may serve as an “active ingredient” contributing to the positive outcomes of music-based intervention, we extended previous findings (Kim et al., 2008, 2009) by including a larger sample size of children with autism obtained using RCT methodology and using objective raters. A modified version of Adamson et al.’s (2004) coding scheme was applied in Analysis 1 to evaluate whether music therapy elicits higher levels of joint engagement than does a control non-music play intervention.
Participants
The present analysis included all 51 participants with autism from a prior randomized control trial (Sharda et al., 2018). Participants were between 6 and 12 years of age and had a clinical diagnosis of Autism Spectrum Disorder according to the Diagnostic and Statistical Manual of Mental Disorders, Fourth edition (American Psychiatric Association, 1994). See Supplementary materials for details on how diagnosis was established.
Participants were predominantly male (43 male and 8 female) and varied in their language, cognitive and motor functioning. Children were excluded from the RCT if they received music therapy (individual or in a group setting) or music lessons for 1 year or more prior to study intake, or had a hearing disorder and/or neurological disorder. Additional information on participants is provided in Table 2 and in the original report (Sharda et al., 2018).
Additional information on the participants
Additional information on the participants
The data presented here came from video recordings of intervention sessions from our prior RCT (Sharda et al., 2018) which followed school-age children with autism who were randomly assigned to receive either the music-based intervention (MI, n = 26) or a non-music play-based control intervention (nonMI, n = 25) once a week for 8 to 12 weeks. Both interventions involved 45-minute weekly sessions (described in detail below). Both interventions were held at the same music therapy center and were led by the same accredited music therapist using established approaches. The music therapist had substantial experience administering both music and non-music interventions as well as experience working with diverse populations including children with and without autism. The music-based intervention used a child-centric approach and made use of musical instruments, songs and rhythmic cues (Bradt, 2012; Guerrero, Turry, Geller, & Raghavan, 2014; Mössler et al., 2017; Nordoff & Robbins, 2007). For an active comparison, the non-music play-based intervention used structurally matched play-based activities to control for treatment intensity, positive treatment expectancies, therapist support and emotional engagement. For both interventions, activities were selected by a music therapist and M.S. to target the same theoretically-motivated domains of communication, social reciprocity, sensorimotor integration (composed of fine motor skills and mul-tisensory integration) and emotional regulation, with the main difference between the two interventions being the use of music. See Supplementary Materials for details on specific activities in each dom-ain/intervention type.
Forty-five minute sessions followed an identical structure for both groups (MI & nonMI): beginning with an introduction (hello song for MI and greeting for nonMI, followed by the child placing pictograms of activities in their preferred order on a schedule board), followed by the completion of four preselected activities, and a conclusion (clean up and goodbye song for MI and verbal farewell for nonMI). This structure was adapted to participants’ needs and there was no difference in treatment fidelity across groups (See Fig. 1 for structure of each session; See Supplementary Materials for information regarding treatment fidelity).

Session structure used in the RCT. The structure of each intervention for the music and non-music interventions was identical except for the use of music in the introduction and conclusion.
The first, sixth and last (8th–12th session depending on the child) intervention sessions were coded for level of joint engagement by three independent raters who were not involved in the original RCT and were blind to the specific hypotheses of the analysis. Our total dataset included 144 videos from n = 48 participants (sessions from 3 participants with missing data from middle or last sessions were not included). A group of undergraduate research volunteers were given an overview of the recorded intervention sessions and the coding scheme. This group coded the same video independently; the three volunteers who coded the most accurately were selected to be raters. BORIS (Friard & Gamba, 2016), a free software for observational video and audio coding, was employed. Raters were provided with a manual covering BORIS, the coding scheme, and examples showing how to distinguish between codes. The three raters trained for approximately 30 hours, which included coding a training set of 11 videos independently, and then comparing and discussing differences, before moving on to code the dataset.
Approximately a third of the videos were assigned to each of the three raters. Each rater only coded one session for each participant. To blind raters to session number, they were asked to only watch the four activities and to skip the introduction and conclusion, which often contained information or discussion regarding the session number. Coding took place over 5 months, during which time there were 3 intermittent checks where the same session was coded by all three raters and was compared and discussed (4 videos total). Ninety-six of these videos (first and last sessions for each participant, n = 48) are included in the current analysis.
When four activities were delivered in a session, they lasted approximately eight minutes each. Certain activities were systematically not coded for joint engagement due to their turn-taking nature (which did not allow for clear, concurrent joint engagement with the activity; See Table 3 for a list of included and excluded activities).
Activity list used for engagement coding for music and non-music interventions
Activity list used for engagement coding for music and non-music interventions
We examined how the music-based intervention compared with a well-matched non-music intervention with respect to the levels of engagement observed during the sessions. For all reliability and engagement coding analyses, percent timeper activity (rather than raw duration) was the unit of measure. Linear mixed-effects models computed in R v.3.3.5 using the lme4 package were used to assess the percentage of time spent in four engagement states (Coordinated Joint, Supported Joint, Object, and Other) as a function of intervention group (MI vs. nonMI) and timepoint (first vs. last session). We also report interactions between intervention group and timepoint. Subject was included as a random factor in all models. Beta estimates, simple effect sizes (mean differences), 95% confidence intervals (CI) and p-values (considered significant at p < 0.05) are reported.
Coding reliability
See Supplementary Materials for details on how we established coding scheme reliability. Our inter-rater reliability results indicated that our adapted version of Adamson et al.’s (2004, 2009) joint engagement coding scheme for toddlers generally showed good levels of reliability when applied to school-age children. Moreover, ICCs for Supported Joint (approximately 0.80) and Coordinated Joint (approximately 0.70) engagement were similar to those reported by Adamson et al., (2004, p. 1178). Depending on the quality of video recordings and developmental level, future work may consider using the three macro-category coding scheme of Kasari et al. (2010) that we present as a secondary analysis.
Results
As seen in Fig. 2, in both intervention groups, the greatest percentage of time was spent in the Supported Joint engagement state (active involvement in the joint activity without acknowledging the the-rapist), followed by Coordinated Joint (active in-volvement in the joint activity while also initiating with and acknowledging the therapist), with Object and Other occurring for approximately 15% of the time or less.

Percentage of time spent in different engagement states in the music-based and non-music control interventions.
This four macro-category engagement coding scheme (Coordinated Joint/Supported Joint/Object/Other) was then used to examine group effects based on type of intervention.
The percentage of time spent in Supported Joint engagement state was significantly higher in the MI group compared to the non-music control group, independent of timepoint (ß= 6.96, p = 0.0040). The simple effect size calculated as a mean difference between the MI and nonMI groups in the percentage of time spent in Supported Joint engagement was 14.079 (95% CI = 10.99 –17.16). Conversely, percent time spent engaged with task-relevant Objects was significantly lower in the MI group compared to the non-music control group (ß= –2.25, p < 0.001). The mean difference between the MI and nonMI groups in the percentage of time spent in engagement with the Object was –4.49 (95% CI = –12.24 –3.25).
There was no significant group difference between the MI and nonMI groups in the percentage of time spent in Coordinated Joint (ß= 1.66, p = 0.19), or Other (ß= –0.57, p = 0.41) engagement states.
Did engagement differ between the two timepoints and did these timepoint differences vary between music-based intervention and non-music intervention?
The percentage of time spent in the Coordinated Joint engagement state was marginally higher at Timepoint 2 versus Timepoint 1, independent of group (ß= –1.98, p = 0.056). The simple effect size calculated as a mean difference between Timepoint 2 and Timepoint 1 in the percentage of time spent in Coordinated Joint engagement was –3.91 (95% CI: –6.99 ––0.83). There was no significant difference between Timepoint 1 and 2 in the percentage of time spent in Supported Joint (ß= 1.00, p = 0.59), Object (ß= 0.13, p = 0.83), or Other (ß= 0.59, p = 0.35) engagement states.
For the Other engagement state, there was a significant interaction between Intervention Group and Timepoint (ß= –1.29, p = 0.043), reflecting a slight increase in the percentage of time spent in this engagement state in MI from Timepoint 1 to Timepoint 2, compared to a slight decrease in nonMI from Timepoint 1 to Timepoint 2. The mean difference between MI at Timepoint 1 and Timepoint 2 was –1.42 (95% CI: –12.29 –9.44), whereas the mean difference between nonMI at Timepoint 1 and Timepoint 2 was 3.78 (95% CI: –7.35 –14.91). There was no significant interaction for the remaining engagement states: Coordinated Joint (ß= –0.63, p = 0.54); Supported Joint (ß= 0.16, p = 0.93); Object (ß= –0.27, p = 0.66).
How did engagement differ between music-based and non-music interventions when using a three macro-category scheme?
Following Kasari et al. (2010), we next analyzed engagement using a three macro-category coding scheme by combining the two highest forms of engagement, Coordinated Joint and Supported Joint, to reflect the time spent in any Triadic Joint engagement state (i.e., the child was engaged with the therapist in a joint activity). Good to excellent reliability was established for the combined Triadic Joint engagement state (ICC = 0.87; 95% CI = 0.81 –0.91). Results are shown in Fig. 3. The percentage of time spent in any Triadic Joint engagement state was significantly higher in the MI group compared to the non-music control group (ß= 8.69, p < 0.001). The simple effect size calculated as a mean difference between MI and nonMI in the percentage of time spent in Triadic Joint engagement was 17.39 (95% CI: 7.45 –240.16). There was no significant effect of Timepoint (ß= –0.90, p = 0.66) or interaction between Intervention Group and Timepoint (ß= –0.51, p = 0.80) for the percentage of time spent in Triadic Joint engagement. Results for the Object and Other codes were the same as those found using the four macro-category scheme above.

Percentage of time spent in different engagement states in the music-based and non-music control interventions using a three macro-category coding scheme.
Music-based interventions increase joint engagement
In this analysis, our objective was to determine whether joint engagement was observed to a greater degree in a music-based intervention, than in a non-music play-based control intervention in school-age children with autism. Participants randomly assigned to MI versus a well-controlled nonMI spent more time in Supported Joint engagement, and less time engaged solely with a task-relevant Object. In addition, when we combined Coordinated Joint and Supported Joint engagement states, we found that children in the music-based intervention group spent significantly more time in any Triadic Joint engagement state compared to the non-music control intervention. Moreover, the effects for Supported Joint, any kind of Triadic Joint, and Object engagement in MI were observed irrespective of timepoint, suggesting that they are inherent to the process of a music-based intervention and do not require time to develop. This reflects the fact that, while in traditional play-based interventions, children can focus solely on objects used in intervention activities, it is more difficult to do this in the context of a music-based intervention, which lends itself to the triadic activity of making music with the therapist. This provides an important advantage, particularly in the case of autism, where individuals have an increased focus on objects and non-social aspects of scenes, especially during interaction (Frazier et al., 2017). Multiple non-exclusive aspects of music-based intervention, such as (a) the intrinsic attraction of music, (b) the interpersonal synchrony and social bonding afforded by joint music making (e.g., Tarr, Launay, & Dunbar, 2014), and (c), relational “musical attunement” practices employed by music therapists (Wigram & Elefant, 2009), likely contribute to this intervention group difference.
Taken together, our findings replicate and extend (using RCT methodology and independent raters blind to hypotheses) previous reports from a small repeated-measures design study (Kim et al., 2008, 2009), indicating that a music-based intervention facilitates joint engagement in children with autism more so than a non-music play-based intervention. These prior findings in fact, align best with Coordinated Joint engagement in our coding scheme, as the results focused on making eye contact with the therapist while engaged in activities (Kim et al., 2008) or spontaneously initiating interaction with the therapist (Kim et al., 2009). The fact that we observed an increase in Supported Joint and Triadic Joint engagement overall in the music-based intervention, but not specifically Coordinated Joint engagement, may stem from differences in how data was sampled. While we coded the entire duration of the activities in selected (first and last) intervention sessions using a stationary camera, Kim et al. (2008; 2009) coded two 4-minute samples from four selected sessions (1, 4, 8, 12), where a camera operator kept the child and therapist in view, and all coding was done during joint engagement episodes involving triadic exchanges between child, therapist, and activity. Therefore, our coding was broader, covering all behaviors that occurred during sessions, while Kim et al.’s (2008, 2009) procedure benefited from better-quality recordings that were more focused and only included episodes that would be defined as Supported Joint engagement in our coding scheme.
As previously mentioned, most engagement states did not exhibit effects of Timepoint. The only exception was the percentage of time spent in the Coordinated Joint engagement state, which was marginally higher at Timepoint 2 versus Timepoint 1. This suggests that participants are shifting from following along in the activity to a more triadic form of play, where they actively acknowledge the therapist as they progress from the first to last sessions of either therapy. We observed one significant interaction between Intervention Group and Timepoint for percentage of time spent in the Other engagement state, which slightly increased over time for MI, while it slightly decreased for nonMI. With four different low-occurring codes included in the Other engagement state (Non-task relevant object, Person-only, Unengagd, and Breaks and Exceptions), all with poor inter-rater reliability, interpretation of this finding is limited.
Limitations
A limitation of the current coding scheme is that it is training- and time-intensive to implement, issues shared with other observational coding schemes. Although raters were naïve to the original study (Sha-rda et al., 2018) and were blind to the hypotheses for this analysis, they could not be blinded to the intervention group, as it was obvious whether or not musical activities were involved, and videos sometimes disclosed information about timepoint. In this RCT, the same therapist delivered all sessions to both groups of participants with the advantage that group differences cannot be attributed to different therapists. While we opted to use a single therapist to reduce the issues caused by heterogeneity in intervention implementation as reported in previous studies (Bieleninik et al., 2017), it is possible that these findings emerged because of the specific therapist’s biases when implementing the two different interventions or that these findings might not generalize across therapists. However, in our coding scheme, differences between joint engagement codes were driven by the child’s contribution (See Table 1), therefore it is less likely that the specific therapist’s implementation of intervention affected our child-based joint engagement ratings. Complementing this, Kim et al. (2008, 2009) employed different therapists for music- and play-based intervention groups, and multiple therapists for each type of intervention. Critically, they found similar effects to those we report, providing evidence that the increase in joint engagement in music-based intervention can indeed generalize across therapists. Future work should explore individual differences amongst therapists in more depth, as emerging research has demonstrated the consequential effects dyadic relationships between child and therapist have on intervention outcomes (Mössler et al., 2019). Finally, since we evaluated joint attention in the context of therapy sessions, we do not know if the observed effects generalize to other interaction partners or other settings in daily life. Implementing baseline and post-intervention measures of engagement outside of the therapeutic context is a necessary next step in determining the factors that drive the generalizability of intervention gains.
A second limitation concerns inherent differences in activities across the two interventions that may have resulted in our engagement findings. While our design aimed to make the two interventions as structurally as similar as possible (see RCT design section), confirmed through fidelity assessment, it is possible that some aspects of the activities were not matched. For instance, the fact that objects were focused on more in the non-music play-based intervention might be a result of the objects simply being more familiar or available to the children in their everyday play, rendering them as more attractive and more likely to be engaged with. On the other hand, because children with prior experience with music interventions were excluded, they could have been less familiar with the instruments used in in the music-based intervention, leading to reduced object-focused engagement. However, given that we compare engagement to a baseline, and we see no time-point effects of object-focused engagement suggests that even when children became familiarized with the objects over the course of the intervention, they did not change their engagement with them suggesting that familiarity was not contributing to our findings. Further, another difference between activities may lie in the differences in the auditory environment; the objects used in the music-based intervention created sound while only a few of the activities in the play-based intervention did (e.g., egg shakers), potentially making the musical objects more engaging. However, while the play-based objects made less sound, by no means was the play-based intervention non-auditory since it involved a great amount of talking between the participant and the interventionist. The fact that the nature of musical sounds and human vocal sounds engage connectivity in different neural networks (Sharda, Midha, Malik, Mukerji & Singh, 2015) suggests that future work is needed to characterize how these two types of auditory stimuli might engage participants differently and may be equated across the two types of interventions.
In Analysis 1, we demonstrated that one active ingredient, or process that leads to intervention outcomes, and distinguishes music from well-matched non-music control interventions, is an increased amount of time spent in high levels of joint engagement (Supported Joint and all Triadic Joint states), and a decreased amount of time spent engaged solely with activity-related objects. It has been established in a variety of contexts that time spent in triadic joint engagement is linked to improved communication outcomes. For instance, young children with and without developmental disabilities who spend more time in triadic joint engagement while playing with their parents have better language outcomes over one year (Adamson et al., 2009). More generally, we know that providing responsive language input that is aligned with a child’s focus of attention is a predictor of better long-term social communication and language outcomes, and this is especially the case for children with autism who have delays in spontaneously following others’ attention (Siller & Sigman, 2002; see also Nadig & Bang, 2016 for a review). Given the benefits gained through joint engagement, parent-training interventions have been developed specifically to increase joint engagement in children with autism, and have been shown to be effective in RCT designs (Kasari et al., 2010). Analysis 1 demonstrated increased triadic joint engagement during the process of music-based interventions relative to control non-music interventions. Based on the literature linking joint engagement to better communication outcomes, it is therefore not surprising that our RCT demonstrated increased communication outcomes specific to the music-based intervention group (Sharda et al., 2018). What is intriguing, however, is that the music-based intervention did not focus on language input. Indeed, there was less opportunity for verbal interaction in music-based than non-music intervention, since activities involved music-making. Our findings thus show that music-based interventions offer a potent package that is able improve social communication in an indirect fashion, offering an easy-to-implement alternative to direct training to increase joint attention and joint action (Kasari et al., 2010), which are often impaired in autism. In ongoing analyses, we are examining response-to-intervention in the original RCT, specifically in children with autism with lower versus higher language abilities to explore whether those with lower language are more likely to benefit from music therapy (Crawford et al., 2017). Future work should examine the relationship between the process and outcomes of music-based interventions more directly, using mediation analysis, to further understand the role of joint engagement.
Analysis 2: Movement in music interventions
Music and body movement are closely related. Movement is intrinsic to music - whether listening, singing, or playing different instruments, the movements that accompany these events are inseparable (Phillips-Silver, 2009). The co-occurrence of music and movement plays a central role in social bonding, and may form an integral part of the perceptual, cognitive, and social-emotional experience in music-based interventions (Janzen & Thaut, 2018; Phillips-Silver, 2009). Music-based interventions are interpersonal and multimodal experiences (Ballan & Abraham, 2016), integrating and activating auditory, motor and multimodal regions of the brain (Zatorre, Chen, & Penhune, 2007). Neuroimaging studies have shown that auditory-motor networks are engaged during both music perception and production (Janata & Grafton, 2003; Zatorre et al., 2007). Developmental work demonstrates that these interactions emerge early in infancy (for review, see Phillips-Silver, 2009), with one study finding that infants produced more rhythmic movement to musical and other rhythmical stimuli than to speech, suggesting a predisposition specifically for rhythmic movement to music and metrical sounds (Zentner & Eeroia, 2010). These auditory-motor interactions provide continuous auditory feedback (Zatorre et al., 2007), allowing individuals to move to music in an organized fashion, for example, by rhythmically synchronizing with the pulse of music, by either nodding the head, tapping the foot, or moving the whole body in various ways (Leman & Godøy, 2010).
Given that music and movement-based activities promote emotional, social, cognitive, and physical integration (Kolodziejski, Králová, & Hudáková, 2014), it is not surprising that they have been used therapeutically; there is considerable evidence supporting the use of music- and movement-based interventions for motor development in typically developing children and children with autism (for review, see Srinivasan & Bhat, 2013). Following a 2-month music and movement program, typically developing children showed significant improvements in their gross motor skills, compared to a non-music, physical education program (Zachopoulou, Tsapakidou, & Derri, 2004). Although music-based intervention in the context of autism has been shown to be effective (Sharda et al., 2018), studies have typically focused on outcomes related to social com-munication, social interaction, and emotional skills –all primary features of autism (Geretsegger et al., 2014). However, in addition to these core impairments, individuals with autism also display significant movement atypicalities (Cook, 2016; Janzen & Thaut, 2018; Srinivasan & Bhat, 2013), which have been shown to contribute to social communication impairments and behavioral features of autism (Janzen & Thaut, 2018). In fact, findings from our RCT show that, complementary to behavioral findings of improved social communication, children with autism showed improved auditory-motor brain connectivity after 8 –12 weeks of music therapy (Sharda et al., 2018), similar to the effect of musical training in neurotypical populations (Zatorre et al., 2007). Given that music inherently leads to more movement, there may be something about movement specifically that makes music-based interventions more successful relative to other interventions which are play-based. Previous studies have reported that early motor skills predict later language abilities in autism, controlling for general developmental level (Bedford, Pickles, & Lord, 2016). Thus, interventions that are music-based, where movement is central, may be able to indirectly improve communication outcomes.
In order to better understand the process of music-based interventions, in Analysis 2, we examined the effects of MI and a non-music control intervention on the amount of movement observed in school-age children with autism, as well as their therapist. Although it is often proposed and easily assumed that movement is inherent to music-making, to our knowledge, we do not have empirical evidence of how different musical activities may compare in the amount of movement they elicit. Knowledge of potential variability between activities in eliciting movement has important implications for their inclusion in intervention planning, in order to maximize motor outcomes. To investigate, we explored the amount of movement elicited by specific activities. Using a video-based optical flow analysis method, whole-body movement amplitude of the child and therapist was calculated separately. Therapist and child movement amplitude patterns were compared across both interventions. We hypothesized that movement amplitude will be greater in the music-based than non-music based intervention and may therefore be an active ingredient contributing to the success of music-based intervention programs.
Analysis 2: Methods
Participants and clip selection
A sub-group of 34 children with autism (Age Range: 8 –12, 28 males) from the music-based and the non-music control groups used in Analysis 1 were selected for subsequent video-based analysis of overall body movement. Children were included if they met the following criteria: 1) The video of their session provided at least a partial view of both the participant and the therapist’s motion for at least 1 minute within a given activity (within a maximum of five seconds of crossover, e.g., the therapist’s hand moving across the child’s body). 2) The child and the therapist remained within the same location for the entire 1 minute segment. 3) The child had at least one activity with usable video, with the same activity occurring at two timepoints (early vs. late) over the course of intervention, with the early and late sessions separated by a minimum of 4 sessions/weeks.
In addition to these general criteria, we systematically excluded activities where both the therapist and child were not simultaneously visible due to the camera angle (e.g., piano), where only one person was playing an instrument at a time (e.g., melodica) and where there was no movement (e.g., book-reading). Thus, a total of four activities from both the MI and the nonMI control intervention were included for analysis (See Table 4). It is important to note that for the vast majority of musical activities, the child and therapist were playing the same instrument, with one exception: handheld percussion. In most of these sessions, the child played some kind of handheld percussion, while the therapist played the ukulele.
Activities included in the movement analysis
Activities included in the movement analysis
Using these criteria, 19 children in the MI group and 15 children in the nonMI group could be included in the movement analysis (Table 5). Children in these groups did not differ in their autism symptoms, Performance IQ, motor skills, or the number of therapy sessions in which they participated. Although a significant difference in Verbal IQ between the two groups was found, this difference was not relevant to our analysis of movement.
Demographic and performance variables for the selected sample
aVABS: Vineland Adaptive Behavior Skills. Scores between 12 and 18 estimate performance in the average range.
For each video selected based on the above criteria, 1-minute clips were extracted from each of the early and late sessions. The starting point of each clip was when the following criteria were met: 1) when the therapist and child started playing together following instruction from the therapist, and 2) when the child, the therapist and their instruments/activities were at least partially visible. The overall movement of the child and the therapist was computed using optical flow analysis (OFA) using the software FlowAnalyzer (Barbosa, 2017; https://www.cefala.org/FlowAnalyzer/), a standard computer-vis-ion technique that infers overall movement by comparing pixel intensities from consecutive frames of a video. Here, the greater the difference in pixel intensities between frames, the greater the amount, or amplitude, or movement (For technical details regarding OFA, see Supplementary Materials).
To examine movement during the MI and nonMI interventions, we identified regions of interest (ROIs) around the therapist and child (See Fig. 4) for each activity at the two timepoints. Motion was calculated from every pixel and summed across the entire ROI using FlowAnalyzer. This provided a measure of overall movement for the therapist and the child during the sessions.

Regions of Interest (ROIs) identified around the therapist (right) and child (left) in A) Music-based intervention and B) Non-music control intervention.
Linear mixed-effects models estimated in R v.3.3.5 using the lme4 package were used to assess the effect of movement as a function of Intervention group (MI vs. nonMI) and Timepoint (early vs. late session) for the child and the therapist, separately. Beta estimates, simple main effect sizes (mean differences), 95% confidence intervals (CI) and p-values (considered significant at p < 0.05) are reported.
Analysis 2: Results
Did movement in music-based intervention differ from non-music intervention?
We found that the children’s movement was significantly higher in the MI group compared to the nonMI control group, independent of timepoint (ß= 2.21, p = 0.035; Fig. 5). The simple effect size calculated as a mean difference between MI and nonMI groups in movement amplitude of child at baseline was 2.68 pixels/s (95% CI: –2.17 –7.53). For the therapist, no significant effect of group was found.

Average movement amplitude in music-based and non-music control interventions.
When considering the children’s movements, no difference in timepoint or an interaction between group and timepoint was found. For the therapist’s movements, while no difference in timepoint was observed, a marginal interaction between group and timepoint was found such that the therapist in the MI group displayed greater movement at Timepoint 2 compared to Timepoint 1, with no such increase in the nonMI control group (ß= –0.90, p = 0.066). The simple effect size calculated as a mean difference between the two groups in movement amplitude of therapist at baseline was 0.60 pixels/s (95% CI: –3.31 –4.51).
How did movement vary between different activities during music-based intervention?
To explore the profile of movement for each activity, we conducted a secondary, exploratory analysis of movement for each musical activity included in our movement analysis as a function of timepoint for the child and the therapist. We found high variability in the overall movement amplitude as a function of activity, as would be expected given the typical kinematics of doing each activity. For example, an activity such as playing egg shakers resulted in the greatest mean movement amplitude for the child at Timepoint 1 (Mean = 25.75, SD = 14.36 pixel/s), while playing the recorder resulted in the lowest mean movement amplitude for the child at Timepoint 1 (Mean = 8.03, SD = 4.3 pixels/s; See Fig. 6 for example), with no changes between timepoints.

Example movement across time of a child and therapist engaged in a high movement-producing and low movement-producing activity during the music-based intervention.
Playing an activity such as the djembe also resulted in greater mean movement amplitude for the child at Timepoint 1 (Mean = 21.98; SD = 10.99 pixels/s) than the recorder and handheld percussion, where the amount of movement elicited by the child was notably lower. As seen in Fig. 7, for handheld percussion, the one activity where the child and therapist differed in the instrument they played, is where they differed in amount of movement as well.

Average movement for child and therapist for all music activities at early and late timepoints.
Music-based interventions result in greater movement
Our second objective was to determine whether movement was observed to a greater degree in a music-based intervention than a non-music play-based intervention. Our analysis revealed a significant difference in the overall amount of movement that was present in the two interventions.
Children with autism who participated in the music-based intervention produced significantly greater movement overall, although the amount of movement produced varied greatly across different musical activities. Further, the therapist showed a marginally significant increase in her overall movement, moving at an amplitude closer to the child’s, over the course of the music-based, but not the non-music, intervention. Our findings of greater overall movement suggest that the process of engaging in MI inherently results in the production of greater movement. Indeed, simply listening to music can lead to spontaneous body movements (e.g., tapping, dancing; Keller, 2009), with musical activities automatically integrating and co-activating auditory, visual, and motor systems simultaneously in both musical perception and production (Ballan & Abraham, 2016; Bangert et al., 2006; Phillips-Silver, 2009).
Another possibility that supports why children with autism might produce greater movement during MI is that music inherently possesses more structure and predictability than non-musical stimuli, allowing children to more easily engage with people and objects. Thus, since music-based interventions naturally embed such predictability, it is possible that an easier learning environment is created for children with autism (Dawson & Osterling, 1997). Further, the creative nature of musical movement allows any movements produced to be valid expressions, leading to more confident participation in music-based activities (Frank & Trevarthen, 2012; Trevarthen & Delafield-Butt, 2013). Since individuals with autism tend to have enhanced perception and cortical response to musical stimuli (Sharda et al., 2015; Stanutz, Wapnick, & Burack, 2012), it is also possible that they are generally more engaged with music-based activities. Understanding what gives rise to greater movement in MI can elucidate the role of movement in the beneficial effects previously reported for music-based interventions (Sharda et al., 2018).
In addition to indicating differences in movement between groups, our analyses also revealed a large variability in how much movement was produced during the different musical activities. Our finding thus suggests that the design of music-based interventions may benefit from selecting activities that maximize the amount of movement potential. While no studies have examined the efficacy of different music-based activities for children with autism, previous studies have shown that movement-based activities resulted in better participation in patients with Alzheimer’s, compared to singing- or rhythm-based activities (Hanson, Gfeller, Woodworth, Swanson, & Garand, 1996). Future studies could examine how music-based interventions may differ in their intervention outcomes as a function of the movement (and instrument) involved.
Limitations
An important point to consider and a limitation of the current study is that interactive settings such as that of MI do not simply involve overall amount of movement, but also coordination of that movement (i.e., how similarly two individuals move when engaged with each other). Indeed, previous work has suggested that greater coordination of interactive movement is related to broad social and communicative success (Fitzpatrick, Diorio, Richardson, & Schmidt, 2013), a key intervention outcome of MI for children with autism (for review, see Srinivasan & Bhat, 2013). However, studies examining coordination in these contexts involve both individuals performing the same task (e.g., swinging pendulums together or rocking chairs together; Fitzpatrick et al., 2013; Marsh et al., 2013). Given that both the child and the therapist are not necessarily doing the same activity simultaneously (e.g., child plays handheld percussion while therapist plays ukulele) and the massive variability in overall movement for individual activities in the current study, it was not meaningful to evaluate movement coordination; activities controlled for activity and the amount of movement required by the therapist and the child would be needed for such an investigation. Thus, determining whether greater movement coordination is a characteristic feature of MI compared to other interventions, and whether this potential difference might influence therapy outcomes in autism would be an important research avenue to explore. Understanding these differences in amount of movement and movement coordination would be pivotal in determining the most appropriate activities for the development of even more effective music therapy programs.
As a final note, although our findings show that MI inherently produces greater movement than our non-music control intervention, the efficacy of MI for individuals with autism is usually assessed by improvements in areas other than movement ability itself (e.g., social or cognitive improvements; Hardy & Lagasse, 2013). However, despite movement disturbances being common in autism (Green et al., 2009), no systematic studies have examined the effect of MI on movement behaviour in autism. Previous studies have shown motor improvement in other disorders using MI (e.g., Parkinson’s; de Dreu, van der Wilk, Poppe, Kwakkel, & van Wegen, 2012). In this study, our sample was not impaired in motor abilities (as measured by the VABS). However, given previous work showing motor improvement in other populations as well as the noted co-activation of motor areas during MI (Bengtsson et al., 2009), future research should investigate how music-based interventions may also improve motor outcomes, in addition to previously reported social and communication improvements in autism.
In Analysis 2, we demonstrated, using a novel analysis of movement, that music-based interventions inherently differ from a non-music control intervention, in that greater overall movement is produced as a result of engaging in an intervention involving music-based activities. Thus, production of movement may serve as a process that generates intervention outcomes during music-based intervention, or is an “active ingredient” of MI, leading to its observed positive outcomes, particularly in children with autism. An important caveat to bear in mind, however, is that although our findings suggest a movement-based advantage to music intervention, our sessions were conducted by a single therapist. It is necessary to replicate these findings of increased movement and determine how they generalize to different therapists as well as to inherent variability in the implementation of music-based interventions.
Conclusions
Our findings show that participation in music-based interventions is characterized by greater joint engagement with the therapist and the activity at hand, and greater overall movement. These two “active ingredients” may be the key processes that mediate positive outcomes in such interventions for children with autism. These findings contribute to a new but growing line of research (e.g., Mössler et al., 2019) investigating the specific processes involved in music-based interventions that make it more effective than other treatments of similar intensity. These findings also illustrate that music-based interventions are complex, multimodal interventions which include both affective (intrinsic motivation and reward leading to greater engagement) and movement-related (sensorimotor modulation and interpersonal synchrony) processes, which are important building blocks of social communication. This evidence that music-based intervention has the potential to make improvements in critical domains should further the development and enhancement of such intervention programs for children with autism, as well as a wide range of other populations.
Conflict of interest
No potential conflict of interest is reported by the authors.
Funding
This work was supported by a McGill University Internal Social Sciences and Humanities Research Council grant to AN.
Footnotes
Acknowledgments
We thank Alexane Doucet, Yuhui Huang and Nadia El Hallaoui for their dedicated work on joint engagement coding. In memoriam of Krista Hyde (4th author) who passed away in 2020.
In this article autism will be used to refer to the diagnosis of Autism Spectrum Disorder.
