Abstract
Social stories are a commonly used intervention practice in early childhood special education. Recent systematic reviews have documented the evidence base for social stories, but findings are mixed. We examined the efficacy of social stories for young children (i.e., 3–5 years) with challenging behavior across 12 single-case studies, which included 30 participants. The What Works Clearinghouse standards for single-case research design were used to evaluate the rigor of studies that included social stories as a primary intervention. For studies meeting standards, we synthesized findings on the efficacy of social stories using meta-analysis techniques and a parametric effect size measure, the log response ratio. Trends in participants’ response to treatment were also explored. Results indicate variability in rigor and efficacy for the use of social stories as an isolated intervention and in combination with other intervention approaches. Additional studies that investigate the efficacy of social stories as a primary intervention are warranted.
Addressing children’s challenging behavior has become a primary focus for practitioners, researchers, and policymakers (Hemmeter & Conroy, 2018; Ostrosky & Sandall, 2013). Limited social skills can result in challenging behavior which negatively impacts many areas of development, including children’s self-confidence, relationships with peers and adults, self-regulation, ability to follow directions, and problem-solving skills (Hemmeter, Ostrosky, & Fox, 2006). Several factors are correlated with increased incidence of challenging behavior, including poor communication skills, delayed social and emotional skills, health issues, and environmental variables (Darling-Churchill & Lippman, 2016; Shonkoff, 2016). Whereas approximately 10% to 15% of typically developing preschoolers exhibit mild to moderate levels of challenging behavior, this percentage is even greater among children from families living in poverty (Powell, Fixsen, Dunlap, Smith, & Fox, 2007). Young children exposed to multiple family risks factors are 2 to 3 times more likely to demonstrate aggression, anxiety and depression, and hyperactivity (Cooper, Masi, & Vick, 2009).
As the number of preschool children who live in poverty increases, more children will enter early childhood programs without the critical skills needed for successful school experiences. In fact, from 1995 to 2011 the percentage of children from low-income families enrolled in public or private preschools increased from 36% to 42% (Burgess, Chien, Morrissey, & Swenson, 2014). The failure to provide adequate social and emotional supports for children is not only costly for young children and their families, but for the community at large. In addition to the possibility of suspension and expulsion, preschoolers with challenging behavior often experience peer rejection and punitive interactions with adults, and they are at greater risk for school failure (Gilliam, 2005; Hemmeter & Conroy, 2018). Alarmingly, challenging behaviors that appear early on in a child’s life are predictive of adolescent delinquency, gang membership, and incarceration (Dodge et al., 2014; Huesmann, Dubow, & Boxer, 2009). Because of the developmental risks for children whose challenging behavior is not addressed early on (Losel & Bender, 2012; Tremblay, 2010), there is a need for interventions that support children who engage in challenging behavior.
Early childhood teachers need evidence-based intervention strategies to address behavioral issues. Social stories Gray and Garand (1993) are one type of intervention that have been applied to decrease challenging behaviors and increase prosocial behaviors in young children (e.g., Benish & Bramlett, 2011; Lorimer, Simpson, Myles, & Ganz, 2002; Rhodes, 2014). Several steps are required to create a Social Story™. It involves identifying a problematic social situation and target behavior, as well as establishing a context for the social situation. This information is gathered from observations of a target child and interviews with caregivers. Social Stories™ include six types of sentences: a) descriptive, which identify the context of the target situation; b) directive, which describe a desired behavior in response to a social cue; c) perspective, which describe reactions or feelings in response to a social situation; d) affirmative, which express the value of a given context or culture; e) control, which provide analogies to promote understanding for the child; and f) cooperative, which include information about who will provide help and how that help will be made available for the child. (Sansosti, Powell-Smith, & Kincaid, 2004, p. 195)
Gray and Garand (1993) recommended that a ratio of two to five descriptive, perspective, and/or affirmative sentences be used for every directive sentence in the story. The goal is to describe the social situation and appropriate behaviors, rather than to direct the child about how to behave. Stories should be written at the child’s comprehension level, clarity in print must be maintained, and vocabulary should be appropriate for the child (Gray & Garand, 1993). By including the aforementioned, children are able to grasp basic concepts and the social story is relevant to their needs.
To date, more than 15 reviews on the efficacy of social stories have been conducted (Garwood & Van Loan, 2019). Several reviews provided narrative syntheses on the effectiveness of this approach (e.g., Karkhaneh et al., 2010; Rhodes, 2014); others applied a systematic analysis of the literature using a methodological framework for evaluating quality, rigor, and effect size metrics for quantifying the effects of social story interventions (e.g., Karal & Wolfe, 2018; Leaf et al., 2015; Mayton, Menendez, Wheeler, Carter, & Chitiyo, 2013; McGill, Baker, & Busse, 2015; Qi, Barton, Collier, Lin, & Montoya, 2018; Zimmerman & Ledford, 2017). Despite the use of quantitative analyses, previous systematic reviews have reached discrepant conclusions about the efficacy of social stories. For example, several researchers noted uncertainty about the efficacy of social stories for children with autism spectrum disorder (ASD) due to weak treatment effects, confounding factors, inadequate participation, multicomponent interventions, and poor study designs and implementation (Karkhaneh et al., 2010; Sansosti et al., 2004; Test, Richter, Knight, & Spooner, 2011). Others described social story interventions as “questionably effective” due to the variability of intervention outcomes observed based on percentage of nonoverlapping data scores (Kokina & Kern, 2010). However, Karkhaneh et al. (2010) concluded that social stories were beneficial in modifying behaviors among high-functioning children with ASD. Similarly, Rhodes (2014) found evidence for the effectiveness of social stories and Wong et al. (2014) identified this approach as an evidence-based practice. It is important to note that previous systematic reviews have focused on children with ASD, and not solely focused on young children with behavioral challenges. Qi et al. (2018) examined the effects of social stories using several different overlap measures and their own independent visual analysis and determined that social stories were not evidence based. Using a quantitative analysis of effect sizes, Reynhout and Carter (2006) concluded that there is variation in the efficacy of social stories and that on average they are only marginally effective.
Although researchers have used existing frameworks such as the WWC design standards and evidence criteria (Kratochwill et al., 2013) and the Single Case Analysis and Review Framework (Ledford, Lane, Zimmerman, Chazin, & Ayres, 2016) to examine the quality and rigor of study designs, few studies have applied these methods in combination with parametric effect size measures and meta-analysis techniques. Compared with other techniques for summarizing evidence, parametric effect size measures and quantitative meta-analysis offer several potential benefits (Pustejovsky & Ferron, 2017). These statistical approaches provide a way to summarize findings about the magnitude of functional relations and examine variation in treatment responses across participants and studies—allowing researchers to distinguish consistently effective interventions from ones that produce variable responses across individuals—and to identify characteristics that explain variation in treatment responses.
One challenge for applying meta-analysis methods to synthesize single-case studies is identifying suitable effect sizes for summarizing the magnitude of functional relations. Widely used indices such as the percentage of nonoverlapping data (Scruggs, Mastropieri, & Casto, 1987) and the Tau-U index (Parker, Vannest, Davis, & Sauber, 2011) have shortcomings that make them poorly suited for use in meta-analysis, including lack of comparability across studies that use different measurement procedures (Tarlow, 2017) and unknown sampling distributions (Shadish, Rindskopf, & Hedges, 2008). In this review, we applied a recently developed parametric effect size index called the log response ratio (LRR; Pustejovsky, 2018b), which has several advantages for synthesizing social story intervention studies. Specifically, the LRR is suitable for use with behavioral dependent variables measured through direct observation, which comprised the majority of outcomes in identified studies. Moreover, the LRR is closely related to the concept of percentage change from baseline, an intuitive and readily interpretable way to quantify the magnitude of functional relations.
In summary, the goal of this systematic review was to evaluate the efficacy of social stories to decrease challenging behavior and increase prosocial skills in children below the age of 5 years by (a) assessing the quality of the available evidence using the WWC indicators, (b) synthesizing findings using a parametric effect size and meta-analysis methods that are suitable for behavioral outcomes, and (c) exploring potential moderators of treatment response.
Method
For this review, six online databases were searched (ERIC, Education Full Text, PsycArticles, PsychINFO, EBSCO, and CSA), and keywords included young children, preschool, Social Stories, and scripted stories. Following the online search, a hand search was conducted of the reference lists from key studies (Leaf et al., 2015; Qi et al., 2018; Wong et al., 2014; Zimmerman & Ledford, 2017). Studies had to meet the following criteria to be included in the review: (a) Study was published in a peer-reviewed journal between 1995 and 2018; (b) study reported one or more single-case designs (SCDs); (c) social stories were used as the primary intervention to reduce challenging behavior and increase prosocial behavior; (d) at least one intervention participant was below the age of 5 years; (e) child outcome data were presented for at least one measure of challenging behavior; and (f) study was conducted in the United States. Several studies meeting these criteria reported multiple SCDs (e.g., ABAB designs replicated across several participants). PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines were followed when conducting the literature search.
Coding Procedures
Intact SCDs identified for inclusion were coded based on descriptive characteristics, strength of research design, and strength of experimental control. We coded participant demographics (i.e., age, disability, gender, race/ethnicity) as reported in the articles. We also coded study design characteristics, including setting, type of SCD, skills or behaviors targeted, presence of maintenance and generalization phases, and procedural fidelity defined as measurement in at least 33% of sessions and average scores higher than 80% across participants, conditions, and implementers (Barton, Meadan-Kaplansky, & Ledford, 2018).
The WWC standards (Kratochwill et al., 2013) for SCDs were used to assess (a) strength of research design (i.e., internal validity) and (b) strength of evidence for experimental control (i.e., visual analysis). Only intact SCDs that met WWC research design standards with or without reservations were used in the subsequent meta-analysis. Standards coded to evaluate the research design were type of study design, systematic manipulation of the independent variable, repeated measurement of the dependent variable, interobserver agreement (IOA) reported for more than 20% of data points in each condition, IOA higher than 80%, three attempts to demonstrate a treatment effect, at least three data points per phase, and an overall rating (e.g., meets WWC standards, meets standards with reservations, does not meet standards). Standards coded “yes” or “no” to evaluate evidence of experimental control were stability of baseline, overlapping data points, immediacy of change, consistency of change, evidence of a functional relation, and strength of the functional relation (e.g., no, moderate, or strong evidence).
The first author coded all designs (24 intact SCDs from 12 studies) that met the inclusion criteria. A doctoral student in special education was trained as a reliability coder on the aforementioned WWC standards in accordance with the definitions provided by the WWC. Five of the 12 studies were randomly selected for reliability coding. Reliability was calculated by dividing the total number of agreements by the number of agreements plus disagreements and multiplying by 100. Average agreement calculated across all quality indicators and all studies was 93% (range = 74%–100%). Both coders reviewed all disagreements and reached consensus.
To calculate effect sizes for the included designs, we used WebPlotDigitizer (Rohatgi, 2018) to extract outcome data from digitized versions of the single-case graphs presented in each article, a process that can yield highly reliable data (Moeyaert, Maggin, & Verkuilen, 2016). Extracted data were organized in an Excel spreadsheet.
Effect Size Calculations
The second author independently calculated parametric effect sizes for each case within each intact SCD, using data from phases that contrasted baseline conditions with a social stories intervention condition in its initial format. Data on the effects of modifications to the intervention were available only for a small subset of participants. As such, we excluded phases that involved modifications and provided a narrative review of the modifications. We also excluded one pair of phases from Burke, Kuhn, and Peterson (2004) because the return-to-baseline phase consisted of a single data point.
For effect size calculations, we used the LRR-increasing form of the LRR (Pustejovsky, 2018b), so that positive values of the effect size correspond to improvements in behavior (i.e., reductions in disruptive behavior and improvements in prosocial behavior). For a behavior where improvement is desirable, the LRR-increasing is defined as
Meta-Analysis
We conducted a meta-analysis of effect sizes from the cases in study designs that met WWC standards with or without reservations. To summarize the distribution of effect sizes across included cases and studies, we used a multilevel meta-analysis model (Pustejovsky, 2018b) that included random effects for studies and for cases nested within studies. We chose not to include random effects for intact designs because only a few studies included multiple intact designs. This model provides estimates of three key quantities: an overall average effect size, a case-level standard deviation (SD), and a study-level SD. The overall average effect size describes the average magnitude of behavior change due to intervention. However, if the effects of social stories vary from case to case, the average effect describes only part of the picture. Estimates of the case-level and study-level SD provide information to fill out that picture, by describing the extent to which the effects vary from case to case (within a study) and across studies. Larger SD estimates indicate that effect sizes are more variable and less consistent.
We used cluster-robust variance estimation (CRVE) methods (Hedges, Tipton, & Johnson, 2010) with small-sample adjustments (Tipton & Pustejovsky, 2015) to calculate standard errors and confidence intervals (CIs) for overall average effect size estimates, with clustering at the study level. This method is robust to misestimation of the standard errors of individual LRRi estimates, as could occur if there is an autocorrelation or trend in the data series. We report several aids for interpreting the meta-analysis results. First, we translate overall average LRR effect sizes into percentage change terms, using the relationship
Finally, in addition to estimates of overall average effect sizes, we conducted meta-regression analyses to explore whether participant characteristics or study design features explain variation in the magnitude of effect sizes. We examined four potential moderators: participant age, participant diagnosis, whether the interventionist was also the primary data collector, and overall WWC design rating. We report separate meta-regressions for each moderator, pooling across challenging behavior, and prosocial behavior due to the small number of studies that include dependent variables of each type. For purposes of examining WWC design ratings as a moderator, we included studies that did not meet WWC standards.
All of the analyses were conducted in the R statistical computing environment (version 3.5.1; R Core Team, 2018), using the SingleCaseES package for effect size calculations (Pustejovsky & Swan, 2018), the metafor package for meta-analysis (Viechtbauer, 2010), and the clubSandwich package for robust variance estimation (Pustejovsky, 2017). Raw data and R scripts for replicating the analyses are available at https://bit.ly/2B0BEPp.
Results
Participants
Database searches led to the identification of 257 studies from 1995 to 2018. Two hundred forty-three studies were excluded following title and abstract review. Two additional studies were identified by searching the references from other studies and literature reviews. Twelve studies were examined at the full text level and met the inclusion criteria. The 12 identified studies included 24 intact SCDs and 30 participants. Figure 1 displays a PRISMA diagram (Moher, Liberati, Tetzlaff, & Altman, 2009) that summarizes the screening process.

PRISMA Flow Diagram: Included & Excluded Studies for Literature Review.
Table 1 provides a summary of the 12 studies that met all the inclusion criteria and were assessed using WWC standards. Across the studies, participating children ranged in age from 2:6 to 10:3 years with a mean age of 5:3 years. Twenty-five children were between 3 and 5 years old. Two children were female and 23 were male. Of the participating children, 22 children were identified as having special needs, which included ASD, developmental delay (DD), and specific language impairment (SLI). Seven children were not identified as having special needs. Specifically, two studies included typically developing children (Benish & Bramlett, 2011; Burke et al., 2004), and one study included a participant who exhibited hyperlexia, an advanced reading ability (Soenksen & Alper, 2006). Six research teams reported the ethnicity of their participants, which included Hispanic, Chinese, Caucasian, and African American (Burke et al., 2004; Chan & O’Reilly, 2008; Hsu, Hammond, & Ingalls, 2012; Ivey, Heflin, & Alberto, 2004; Kuoch & Mirenda, 2003; Schneider & Goldstein, 2010).
Study Summary, Overall Design Rating, and Strength of Evidence of Experimental Control.
Note. DV = dependent variable; IV = independent variable; YO = years old; M = male; F = female; MBP = multiple baseline across participants; ASD = autism spectrum disorder; SLI = specific language impairment; DD = developmental delay; MBB = multiple baseline across behaviors; MBA = multiple baseline across activities.
Settings
The included studies were conducted in a variety of settings. Seven research teams conducted social story interventions in classrooms (Benish & Bramlett, 2011; Chan & O’Reilly, 2008; Crozier & Tincani, 2007; Hsu et al., 2012; Schneider & Goldstein, 2010; Soenksen & Alper, 2006; Wright & McCathren, 2012). One study was conducted in both home and classroom environments (Kuoch & Mirenda, 2003), whereas another was conducted in both the participants’ home environment and a university research room (Leaf, Oppenheim-Leaf, Call, Sheldon, & Sherman, 2012). Two studies were conducted only in home environments (Burke et al., 2004; Lorimer et al., 2002); one study was conducted in a clinic (Ivey et al., 2004).
Target Skills
Across all studies, the goal was to decrease challenging behaviors (e.g., avoidance, physical aggression, name calling, tantrums, destruction of property, crying, yelling, making negative comments, disruptive bedtime behaviors) and/or increase prosocial behaviors (e.g., raising hand, saying a peer’s name, looking at peer’s face, sitting appropriately during circle time, and following directions).
Maintenance and Generalization
Only one research team reported both generalization and maintenance data (Leaf et al., 2012). Seven studies reported maintenance data.
Multicomponent Interventions
Three studies examined the implementation of social stories as a packaged intervention (Burke et al., 2004; Chan & O’Reilly, 2008; Crozier & Tincani, 2007). For these studies, additional components included verbal prompts, role-play, and positive rewards. Leaf et al. (2012) compared social stories with a teaching interaction procedure and data on these two interventions were reported in separate SCD graphs. For the purpose of this review, only data on the social story intervention were analyzed.
Implementation of Procedural Fidelity
Although 10 research teams discussed implementation of procedural fidelity procedures, studies varied in reported procedures and outcomes. Two studies did not report procedural fidelity data (Lorimer et al., 2002; Soenksen & Alper, 2006). Across the 10 studies that reported outcomes, the average procedural fidelity was 99% (range = 96%–100%).
Strength of Research Design
Study design
Eight studies included multiple baselines across participants or behaviors, whereas six included at least one reversal design. Seven studies contained multiple designs. Table 1 reports overall assessments of the strength of research design and evidence of experimental control for each design. Tables S1 and S2 in the online supplementary materials report ratings for the specific criteria that inform these assessments.
Independent and dependent variables
Researchers implemented the independent variable, or intervention, before participants entered the target setting where behavior was observed. For example, an interventionist read the social story to the participants prior to them entering the target setting where challenging behavior typically occurred (Lorimer et al., 2002; Soenksen & Alper, 2006). Researchers also selected a variety of measurement systems to assess dependent variables. For instance, Lorimer et al. (2002) and Soenksen and Alper (2006) used event recording to measure the frequency of occurrence of target behaviors during baseline and intervention phases.
Interobserver Agreement
IOA data were measured for at least 20% of the data points in each condition and were reported to be greater than 80% on average for 11 of the 12 studies. Across the 11 studies, the average IOA was 94.24% (range = 81%–100%). In one study, the authors indicated that researchers were trained on IOA; however, IOA data were not reported (Benish & Bramlett, 2011).
Potential demonstrations of effect
Twenty-three intact designs from 11 studies provided three attempts to assess potential treatment effects.
Data points per phase
All 12 studies included at least one design that had three or more data points per phase, although two studies included at least one design that had less than three data points per phase (Burke et al., 2004; Crozier & Tincani, 2007). Neither study highlighted the number of data points per phase as a limitation. Only 4 of the 12 studies included one or more designs that had at least five data points per phase (Crozier & Tincani, 2007; Kuoch & Mirenda, 2003; Schneider & Goldstein, 2010; Wright & McCathren, 2012).
Overall rating
Designs were scored as not meeting minimal standards if they did not adhere to the minimal criteria described earlier. Three studies contained at least one design that met the WWC standards (Crozier & Tincani, 2007; Schneider & Goldstein, 2010; Wright & McCathren, 2012), whereas seven studies contained at least one design that met the standards with reservations. In contrast, five studies contained at least one design that did not meet the WWC standards.
Evidence of Experimental Control
Each study design was evaluated to determine if a relation between the independent variable and an outcome variable was demonstrated. Each design was scored as (a) no evidence if it did not provide at least three demonstrations of an effect, (b) moderate evidence if at least one demonstration of a noneffect was evident, or (c) strong evidence if it provided three or more demonstrations of effect and no evidence of noneffects.
Stable baseline
Four studies contained at least one design that demonstrated a stable baseline (Crozier & Tincani, 2007; Leaf et al., 2012; Soenksen & Alper, 2006; Wright & McCathren, 2012), with three data points demonstrating minimal variability and a consistent level throughout baseline. Eight studies demonstrated unstable baselines due to variability and inconsistency in the trend of the data points within each phase and across phases. For example, in Chan and O’Reilly (2008), baseline data were highly variable. In addition, prior to introducing the intervention phase, a downward trend in baseline data was evident.
Overlapping data points
All 12 studies contained overlapping data points across at least one phase of the design. In Lorimer et al. (2002), two baseline data points overlapped with two intervention data points. When the baseline phase was repeated, an overlap in data points was still evident. This pattern was also observed in the Crozier and Tincani (2007) study. In fact, for their second participant, data points from baseline to the first intervention phase overlapped significantly, which resulted in no changes in behavior across those phases for this participant.
Immediacy and consistency of change
To determine immediacy of an effect, a change in level between the last three data points of one phase and the first three data points of the next phase should be evident. Seven studies contained at least one design that demonstrated an immediate effect of intervention on target behaviors (Benish & Bramlett, 2011; Burke et al., 2004; Chan & O’Reilly, 2008; Crozier & Tincani, 2007; Kuoch & Mirenda, 2003; Lorimer et al., 2002; Schneider & Goldstein, 2010). For example, Chan and O’Reilly (2008) demonstrated an immediate change in level between the last three baseline data points and the first three intervention data points. However, this change in level was consistent only for the first two tiers of the design. Crozier and Tincani (2007) demonstrated similar results for two participants. An immediate and consistent effect was evident for a third participant following the implementation of a second intervention, verbal prompts. Three of the seven studies failed to demonstrate consistency of change across phases and conditions (Benish & Bramlett, 2011; Kuoch & Mirenda, 2003; Lorimer et al., 2002).
Evidence of functional relation
Six studies contained at least one design that demonstrated a functional relation between the independent and dependent variables (Benish & Bramlett, 2011; Burke et al., 2004; Crozier & Tincani, 2007; Lorimer et al., 2002; Schneider & Goldstein, 2010; Soenksen & Alper, 2006), whereas 10 studies did not demonstrate a functional relation for at least one of their designs (Benish & Bramlett, 2011; Burke et al., 2004; Chan & O’Reilly, 2008; Crozier & Tincani, 2007; Hsu et al., 2012; Ivey et al., 2004; Kuoch & Mirenda, 2003; Leaf et al., 2012; Lorimer et al., 2002; Wright & McCathren, 2012); these studies did not meet this criterion because of unstable baselines, overlapping data points across baseline and intervention, and no intervention phase. In addition, an immediate change in the dependent variable was not evident when the independent variable was introduced or changes in the dependent variable were not consistent across repeated phases (i.e., baseline, intervention).
Strength of relation
Six studies contained at least one design that demonstrated moderate evidence of experimental control (Benish & Bramlett, 2011; Burke et al., 2004; Crozier & Tincani, 2007; Lorimer et al., 2002; Schneider & Goldstein, 2010; Soenksen & Alper, 2006), whereas nine studies contained at least one design that did not demonstrate experimental control (Benish & Bramlett, 2011; Burke et al., 2004; Chan & O’Reilly, 2008; Crozier & Tincani, 2007; Hsu et al., 2012; Ivey et al., 2004; Kuoch & Mirenda, 2003; Leaf et al., 2012; Wright & McCathren, 2012). All cases demonstrated moderate evidence for three studies (Lorimer et al., 2002; Schneider & Goldstein, 2010; Soenksen & Alper, 2006) and no evidence for six studies (Chan & O’Reilly, 2008; Hsu et al., 2012; Ivey et al., 2004; Kuoch & Mirenda, 2003; Leaf et al., 2012; Wright & McCathren, 2012). No studies demonstrated strong evidence of experimental control.
Meta-Analysis and Meta-Regression
Table 2 reports the results of the overall meta-analysis of LRRi effect sizes, including estimates of overall average effect sizes, study-level variation, and case-level variation, based on all included studies and cases. For challenging behavior, the average LRRi estimate was 1.22, which corresponds to a reduction of 70% from baseline levels, 95% CI = [–93%, 22%], which was not statistically distinguishable from a null average effect. For prosocial behavior, the average LRRi estimate of 0.94 corresponds to a 155% improvement, 95% CI = [56%, 318%], which was statistically distinguishable from null. However, there was a substantial variation in the effects across cases for both types of outcomes, as well as across studies for challenging behavior outcomes. Accounting for case- and study-level variation, a 67% PI for challenging behavior ranges from a 15% to 90% reduction. A 67% PI for prosocial behavior ranges from a 16% reduction (i.e., iatrogenic effect) to a 677% improvement from baseline.
Meta-Analysis and Moderator Analysis of LRRi Effect Size Estimates.
Note. LRRi = log response ratio increasing; CI = confidence interval; WWC = What Works Clearinghouse.
We conducted separate meta-regression analysis of four potential moderators, including WWC study design rating, participant age, participant diagnosis, and whether the interventionist was also the primary data collector (see Table 2). None of the four moderators explained a statistically significant degree of variation in the effect size estimates. Although the differences are not statistically distinguishable, it is worth noting that the average effect size estimates were smaller (i.e., less beneficial effects) for (a) participants with diagnosed disabilities, (b) studies that met WWC design standards without reservations, and (c) studies where the interventionist was not also the primary data collector. After controlling for all four moderators using a join meta-regression, the average LRRi effect sizes for challenging behavior and prosocial behavior were reduced and imprecisely estimated. See Table S3 of the supplementary materials.
Multicomponent Interventions
Similar to studies that implemented social stories in isolation, data on the four studies that implemented social stories as a package were variable in terms of rigor and effectiveness. Although all four studies demonstrated low to medium rigor for at least one of their designs (Burke et al., 2004; Chan & O’Reilly, 2008; Crozier & Tincani, 2007; Schneider & Goldstein, 2010), only two demonstrated high rigor (Crozier & Tincani, 2007; Schneider & Goldstein, 2010).
Discussion
The efficacy of social stories in decreasing challenging behavior and increasing prosocial skills was evaluated by assessing the quality of available evidence using the WWC indicators and synthesizing the findings using a parametric effect size and meta-analysis methods. We also explored trends in participant response to treatment based on participant diagnoses, participant age, WWC design rating, and primary data collector. Overall, results indicate variability in rigor and effectiveness for the use of social stories as an isolated intervention and in combination with other intervention approaches.
For the studies that met the minimal WWC standards, we found that social story interventions for preschoolers had variable effects on challenging behavior and prosocial behavior. Several studies contained at least one design that did not meet minimal WWC standards and did not demonstrate experimental control. It is important to note that although studies may provide evidence of a strong design, they may still fail to demonstrate evidence of a causal relation between independent and dependent variables. Only six studies contained at least one design that met the standards or met them with reservations while also demonstrating experimental control. It is important to interpret these findings with caution, as data were highly variable for five of these six studies (Chan & O’Reilly, 2008; Crozier & Tincani, 2007; Lorimer et al., 2002; Schneider & Goldstein, 2010; Soenksen & Alper, 2006). Hence, the effectiveness of social story interventions is uncertain because several studies did not adhere to the WWC standards, and variability in the data made it difficult to reach a definitive conclusion based on visual analysis.
Parametric effect size calculations estimated an average reduction in challenging behavior and improvement in prosocial skills for participants from baseline levels, although substantial variation was observed across cases for both outcomes. Surprisingly, little of this variation was explained by the four moderator variables examined (i.e., participant diagnosis, participant age, WWC design rating, primary data collector). It is also notable that the average effect size estimates were smaller in magnitude when controlling for these moderators, although none of the individual moderators were statistically significant. That is, studies conducted with high rigor tended to exhibit lower effect size magnitude than studies with low rigor. In addition, average effect size estimates tended to be smaller for participants with disabilities and studies conducted where the interventionist was not also the primary data collector. These trends are worrisome because they suggest potential biases that might impact the integrity of study outcomes. Thus, additional studies conducted with high rigor are needed to get reliable estimates of effect sizes and to understand factors that explain variation in treatment response.
A concern worth highlighting is the presence of multiple-component interventions without an attempt to evaluate the efficacy of social stories in isolation. Whereas eight research teams implemented social stories as a singular intervention, four teams combined social stories with other interventions or instructional methods, such as verbal prompts, rewards, visual schedules, or role-play (Burke et al., 2004; Chan & O’Reilly, 2008; Crozier & Tincani, 2007; Schneider & Goldstein, 2010). Two multicomponent designs met the WWC standards with reservations and provided moderate evidence for experimental control (Burke et al., 2004; Chan & O’Reilly, 2008). Although two additional multicomponent designs did not meet minimal standards due to the strength of their design, they demonstrated moderate evidence of experimental control (Crozier & Tincani, 2007; Schneider & Goldstein, 2010). As such, multicomponent interventions warrant further investigation to consider the additive effect of intervention components on participant behavior. Researchers should attend to the contributions of each intervention component to determine whether behavioral changes are the result of social stories, the impact of other interventions, or a combination of both.
Limitations
This review is one step toward evaluating the efficacy of social stories for young children who engage in challenging behaviors. One limitation of our review is that we focused only on studies in which challenging behavior and prosocial skills were the target behaviors. It is important to examine the effectiveness of social stories when implemented for other behaviors such as adaptive skills and oral communication (Kassardjian et al., 2014; Laprime & Dittrich, 2014; Raver, Bobzien, Richels, Hester, & Anthony, 2013).
Another limitation is that our search strategy did not capture studies conducted outside of the United States or studies that were not peer reviewed (i.e., dissertations). It is possible that findings from such studies systematically differ from published studies, particularly because studies with large, visually apparent effects may be easier to publish than studies with smaller effects or more ambiguous data (Gage, Cook, & Reichow, 2017). If these trends hold in the literature, our findings might overstate the effectiveness of social stories. Future research should invest in searching for and reviewing unpublished studies to mitigate the risk of publication bias.
Future Research
Findings from this review highlight the need for powerful behavioral interventions to impact persistent challenging behaviors. Thus, researchers should examine the use of more intensive and multicomponent interventions with young children in addition to the impact of the number of intervention sessions, or dosage, on child outcomes (Fey, Yoder, Warren, & Bredin-Oja, 2013). Although the method of determining adequate dosage varies across the literature (i.e., number of times or frequency of dose; Parker-McGowan et al., 2014), it is widely understood that interventions are less efficacious when participants do not receive an adequate number of intervention sessions.
It is also important to evaluate how social story interventions conducted in early childhood settings align with contemporary standards for SCDs, as well as evaluate their effectiveness using robust effect size estimates. Although reviews have been conducted on social story interventions with older children (Karkhaneh et al., 2010; Rhodes, 2014; Sansosti et al., 2004), only three research teams (Qi et al., 2018; Wong et al., 2014; Zimmerman & Ledford, 2017) systematically reviewed how these studies adhere to the WWC standards for SCD. A few reviews reported nonoverlapping measures to quantify the effectiveness of social stories as a primary intervention (Kokina & Kern, 2010; Qi et al., 2018; Reynhout & Carter, 2006, 2011; Test et al., 2011), but none used parametric effect sizes or meta-analysis methods to synthesize findings. In future research, we encourage researchers to adopt parametric effect size measures, such as LRRs, that are both interpretable and suitable for the dependent variables used in the literature. Similarly, future research should use meta-analytic models, such as multilevel random-effects models, that not only summarize average effects, but also describe variability in effectiveness and examine factors that may explain this variation. Finally, future research should examine the consistency and accuracy with which a social story intervention is implemented for young children. This would involve evaluating modifications made to social story interventions and how such modifications impact challenging behavior. Finally, as discussed earlier, it is important to note that social stories were developed to support children with ASD. Future research should investigate what groups of children and behaviors are best suited for social story interventions.
Implications for Practice
Many questions still remain unanswered as to the efficacy of social stories. Since 1993, social stories have been widely used in a variety of settings to support the social and emotional development of young children, yet moderate empirical evidence exists to support this intervention. Educators, families, and related service personnel have provided anecdotal evidence for the effectiveness of social stories with young children, but the analyses presented in this article demonstrate that the effectiveness of social stories is variable. Given that studies demonstrated variability in rigor and effectiveness when social stories were implemented in isolation or as a packaged intervention, this practice should not be considered an evidence-based strategy for young children.
Social stories can be implemented in combination with other developmentally appropriate strategies, for researchers have shown that many preventive strategies can be implemented within a tiered model of support and result in decreased levels of challenging behavior (cf. Covington-Smith, Lewis, & Stormont, 2011; Hemmeter, Snyder, Fox, & Algina, 2016). The implementation of promotion and prevention strategies highlights the need for professional development that focuses on helping teachers and support staff acquire additional skills to address challenging behavior. Employing these practices can support young children’s social emotional development, as well as their overall learning and development.
Conclusion
The purpose of this systematic review was to critically evaluate the impact of social stories on young children with challenging behaviors. Results indicate variability in rigor and effectiveness of the use of social stories as an isolated intervention, as well as in combination with other intervention approaches. Thus, social stories cannot be considered an evidence-based practice for young children. Although several reviews recommend practitioners not to use social stories as an intervention (Qi et al., 2018; Zimmerman & Ledford, 2017), whereas other reviews indicate that social story interventions are mildly effective (Reynhout & Carter, 2011; Rhodes, 2014; Test et al., 2011), the findings from this review highlight the need for additional research to improve our understanding of the efficacy of social story interventions in isolation and in combination with other interventions.
Supplemental Material
Supplementary_Materials_Revised_4-25 – Supplemental material for Examining the Effects of Social Stories™ on Challenging Behavior and Prosocial Skills in Young Children: A Systematic Review and Meta-Analysis
Supplemental material, Supplementary_Materials_Revised_4-25 for Examining the Effects of Social Stories™ on Challenging Behavior and Prosocial Skills in Young Children: A Systematic Review and Meta-Analysis by Charis L. Wahman, James E. Pustejovsky, Michaelene M. Ostrosky and Rosa Milagros Santos in Topics in Early Childhood Special Education
Footnotes
Acknowledgements
The authors would like to acknowledge Moon Chung and Danielle Pizzella for their assistance with reliability assessment and data collection.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
This research was supported in part by a leadership grant from the Office of Special Education Programs (Project FOCAL, H325D070061) and by Grant R305D160002 from the Institute of Educational Sciences, U.S. Department of Education. The opinions expressed are those of the author and do not represent the views of the Institute or the U.S. Department of Education.
Supplemental Material
Supplemental material for this article is available on the Topics in Early Childhood Special Education website with the online version of this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
