Abstract
The purpose of this article is to demonstrate how hierarchical linear modeling (HLM) can be used to enhance visual analysis of single-case research (SCR) designs. First, the authors demonstrated the use of growth modeling via HLM to augment visual analysis of a sophisticated single-case study. Data were used from a delayed multiple baseline design, across groups of participants, with an embedded changing criterion design in a single-case literacy project for students with moderate intellectual disabilities (MoID). Visual analysis revealed a functional relation between instruction and sight-word acquisition for all students. Growth HLM quantified relations at the group level and revealed additional information that included statistically significant variability among students at initial-baseline probe and also among growth trajectories within treatment subphases. Growth HLM showed that receptive vocabulary was a significant predictor of initial knowledge of sight words, and print knowledge significantly predicted growth rates in both treatment subphases. Next, to show the benefits of combining these methodologies to examine a different behavioral topography within a more commonly used SCR design, the authors used repeated-measures HLM and visual analysis to examine simulated data within an ABAB design. Visual analysis revealed a functional relation between a hypothetical intervention (e.g., token reinforcement) and a hypothetical dependent variable (e.g., performance of a target response). HLM supported the existence of a functional relation through tests of statistical significance and detected significant variance among participants’ response to the intervention that would be impossible to identify visually. This study highlights the relevance of these procedures to the identification of evidence-based interventions.
Keywords
This article is a demonstration of how hierarchical linear modeling (HLM) can be used to augment visual analysis of single-case research (SCR) designs. Our intent was to draw on the strengths of the two complementary research methods (Nugent, 1996) to demonstrate the benefits of using HLM as an additional, inferential-statistical analysis allowing researchers to gain as much information as possible about the effects of two different interventions within special education research. The logic and rationale for our first demonstration developed after we began examining data from an existing and complex SCR design. This study was part of a larger Institute of Educational Sciences (IES) research project to develop a comprehensive and integrated literacy curriculum (ILC) for students with moderate to severe disabilities (Alberto & Fredrick, 2007; Grant R324A070144). The primary research question that drove the single-case design was “What effect does Integrated Literacy reading instruction have on sight-word acquisition for students with moderate to severe intellectual disabilities (MSID)?” The unique aspect of this SCR study that warranted additional statistical analyses was the population of students under investigation. Literacy for students with MSID is an area of special education research that is in nascent stages and blooming with new findings. Researchers are finding that students with MSID are capable of learning literacy skills at a higher level than ever thought previously, that new instructional procedures are available to effectively teach literacy skills to students with MSID, and that newly developing literacy curricula that incorporate the most effective teaching procedures allow students with MSID to learn to their maximum potential (Alberto, Waugh, & Fredrick, 2010; Browder, Ahlgrim-Delzell, Courtade, Gibbs, & Flowers, 2008; Cohen, Heller, Alberto, & Fredrick, 2008; Davis, 2011; Fredrick, Davis, Alberto, & Waugh, 2011). However, we found that as new findings emerge at a quickening rate, so do new research questions that require additional analyses.
Visual analysis of our single-case study revealed important new findings relevant to literacy for students with MSID and also generated new research questions that needed to be addressed with a different type of research method. A delayed, multiple baseline with an embedded, changing criterion was an ideal SCR design to investigate the effectiveness of the reading intervention for 11 students with MSID. Visual analysis of the single-case design revealed a functional relation between the independent variable (IV) and dependent variable (DV) at the individual level that supported Integrated Literacy as an effective instruction for each of the students. However, the investigators had the following questions that could not be answered through visual analysis of Baseline and Intervention phases alone, but could be examined through additional analyses conducted at the group level. Was the intervention equally effective for all the students in the study and, if not, what individual skills might have accounted for the differences in learning? Did all the students acquire the exact same amount of knowledge? Were there differences in students’ knowledge of sight words at the onset of the study, and if so, was there an individual skill that explained variance among the students’ initial knowledge of sight words? Were the observed changes in learning statistically significant? We implemented growth HLM to examine these questions. Our primary goal was to provide a demonstration of how we combined research methods to answer specific research questions that were generated from an applied study within special education research. Furthermore, our demonstration supports a broader movement in educational research: evidence-based interventions (EBIs).
In an attempt to provide EBIs, an educational policy movement has facilitated heightened attention to combining statistical procedures with educational SCR to produce standardized interpretations for comparing intervention effectiveness (Jenson, Clark, Kircher, & Kristjansson, 2007; Odom et al., 2005; Parker, Vannest, & Davis, 2011; Van den Noortgate & Onghena, 2003, 2008). In an effort to improve the quality of educational research, the Department of Education’s IES founded the What Works Clearinghouse (WWC) to establish standards for the highest quality research methods and to evaluate educational research studies that meet those standards (WWC, 2008). The WWC uses causation as the primary determinant of an effective intervention. The standards for measuring the effectiveness of experimental group studies are based on the degree of adherence to randomized controlled trials (RCTs). The WWC looks for the demonstration of causation in SCR studies also, but because SCR is often conducted with low-incidence populations, random assignment to groups in which the participants must be equivalent in characteristics is highly unlikely. Instead, causation is demonstrated when an individual participant, or small group of participants, acts as his or her own control and a functional relation is observed (Kazdin, 1982). Thus, group and single-case designs can demonstrate causation and meet rigorous standards even though the research methods and associated participants are fundamentally different. Even so, researchers continue to explore the use of statistical analyses in SCR because of the compelling benefits of producing standardized interpretations of results across special education to identify EBIs (Kratochwill & Stoiber, 2000; D. M. White, Rusch, Kazdin, & Hartmann, 1989).
Another reason researchers continue to promote the supplementation of visual analysis with statistical procedures is to provide more sensitive measures for the data. Visual analysis has been shown to be an effective method of examining the graphed data of individuals for patterns indicative of the presence or absence of change in behavior/learning as a result of individualized treatments (Kazdin, 1982). Traditional visual analysis involves comparisons of means, medians, and percentages within and across study phases. In addition, visual analysis of a graphic presentation provides a detailed view of participant performance across instructional or treatment sessions within a sequence of phases, demonstrating a relationship between an intervention and its effect on an individual’s behavior or skill acquisition (Tawney & Gast, 1984).
Some researchers, however, argue a scenario in which visual analysis detects only the strongest treatment effects while more subtle, yet important, treatment effects may go unidentified (Brossart, Parker, Olson, & Mahadevan, 2006; DeProspero & Cohen, 1979; Harbst, Ottenbacher, & Harris, 1991; Kromrey & Foster-Johnson, 1996; Ottenbacher, 1990; Park, Marascuilo, & Gaylord-Ross, 1990). Other researchers have asserted that visual analysis can be imprecise (Brossart et al., 2006). Studies investigating the reliability of interrater agreement of multiple graphs have found .40 to .60 reliability, which is in the low-to-moderate range (Harbst et al., 1991; Ottenbacher, 1990; Park et al., 1990). In addition, the inclusion of graphic features such as trend lines has not improved the reliability of visual-analysis judgments (Hojem & Ottenbacher, 1988), nor has extensive training in visual analysis (Harbst et al., 1991; Wampold & Furlong, 1981).
However, determining which inferential or group-statistical analyses could supplement visual analysis or fit more generally in SCR is not easy. A true challenge arises in the development of statistical analyses that can incorporate the erudition and sensitivity involved in many single-case designs (D. M. White et al., 1989). White and colleagues maintained that single-case designs can get elegantly complicated to reflect accurately intricate human behavior/learning, thereby rendering the use of statistical analyses challenging, if not problematic. Applied special education research has been referred to as the most difficult type of scientific research to conduct (Berlinger, 2002). Odom et al. (2005) discussed the issues that contribute to the SCR complexities, such as the diversity of the special education population, and the contexts in which studies are conducted. The Individuals With Disabilities Education Act (1997) described 12 categories in which students may be eligible for special education services. Each category, or type of disability, can exist in different forms and manifest to different degrees. Special education services are provided in a variety of settings, depending on the individual needs of the student. Students may receive special education services in the general education, special education, or small-group setting for all or portions of the school day. Students and young adults with disabilities may receive services in community, home, vocational, other transitional settings, or in a combination of these settings.
Parker and Brossart (2006) reported that single-case designs have become increasingly complex over time to accommodate the growing knowledge base for such a diverse subsection of education, and researchers do not have appropriate tools with which to apply statistical analyses to these designs. The complex nature of single-case data violates underlying assumptions that must be met for many inferential analyses to be valid: normal distribution with constant variance, independence of observations, and random selection of participants (Kratochwill et al., 2010). In addition, even if the assumptions were met, many inferential analyses such as ANOVAs and t tests would not have adequate power to detect significant differences with the relatively small number of participants found typically in a single-case design. The use of randomization tests (Edgington, 1980; Todman & Dugard, 2001) has been suggested as a means of addressing the issue of small sample size, but the use of randomization tests in SCR is fundamentally problematic because random assignment of individuals to test conditions can be unethical and/or impossible with low-incidence populations (Haardoerfer & Gagné, 2010; Kazdin, 1980; Van den Noortgate & Onghena, 2003). Another dilemma is that frequent measurements of the same participants in a relatively short span of time results in a violation of the assumption of independence of observations. Most statistics used for the purpose of analyzing repeated-measures data, including regression analysis, are not designed for the sophistication of single-case designs, wherein results pertinent to multiple growth rates within phases and multiple shifts between phases are of interest to the researcher.
Serial dependency can be viewed as an inherent property of an individual’s data and is another factor that limits the eligibility of single-case data for most statistical analyses. Jones, Weinrott, and Vaught (1978) examined the effects of serial dependency on agreement between visual-analysis data inferences and statistical time-series inferences. Serial dependency, also known as autocorrelation, is demonstrated when temporally adjacent scores are related to one another, such that performance in Data Point 1 predicts performance in Data Point 2 and so on throughout baseline and treatment phases. Jones et al. found agreement between the two methods of inference to be an inverse function of autocorrelation levels. Low levels of autocorrelation were associated with higher agreement, and higher levels of autocorrelation were associated with lower agreement. In sum, they found that autocorrelation affected statistical results, visual-analysis results, and agreement between the two.
Nonparametric models have been developed to examine single-case data for effect sizes and do not require the assumptions of distribution to be met. Within visual analysis, data dispersion, or variability among data, is examined typically by calculating the percentage of overlapping data points between phases. Often, effect sizes are determined by calculating the percentage of nonoverlapping data (PND) points between baseline and treatment phases (Scruggs, Mastropieri, & Casto, 1987). The PND, however, has been criticized for several reasons, including the lack of a sampling distribution, which prevents confidence intervals and p values from being calculated, and PND includes only one baseline data point while excluding all others. The latter criticism is the most common one because of the fact that the one baseline data point under consideration, which is the most extreme data point, can also be the most unreliable one.
Parker, Hagan-Burke, and Vannest (2007) developed the percentage of all nonoverlapping data points (PAND) as a potential new approach to determining effect sizes in single-case studies that show positive treatment effects. PAND differs from PND in that it incorporates all data points in phases, thereby circumventing reliance on a singular data point and demonstrating a more frugal use of data. A PAND coefficient can be converted to the commonly used Pearson’s Phi, which has a known sampling distribution leading to the calculation of p values and confidence intervals. A criticism of PAND and PND is that they do not take into account prior baseline trends that, through visual analysis, are taken into account before a functional relation is declared between a treatment and a behavior/learning change. PAND and PND simply measure mean level shifts. O. White (1987) suggested integrating a baseline slope in PND calculations, to which Salzberg, Strain, and Baer (1987) replied prophetically that single-case data might be too complicated to analyze statistically.
Some attempts have been made to provide other statistical techniques to SCR designs. Solanas, Manolov, and Onghena (2010) examined a simple AB design to estimate slope and level change between two adjacent phases when n = 1. Parker and Brossart (2006) used hypothetical data to demonstrate phase-contrast techniques to produce effect sizes for simple mean shifts, for ABA, ABAB, and multiple baseline designs, yet paid no attention to growth within phases. All these researchers experienced and discussed difficulties with violations of basic statistical assumptions such as independence of observations and autocorrelation that resulted in distorted inference testing.
HLM is another advanced, inferential technique that has been applied to SCR data. An advantage of HLM is that it is flexible and can accommodate data obtained from heterogeneous populations and settings, as well as data from complicated single-case designs. Nugent (1996) suggested the use of HLM to investigate change over time within a phase, yet HLM has been used most commonly in a meta-analytic framework (Jenson et al., 2007; Van den Noortgate & Onghena, 2003, 2008) to compare phase means as a measure of treatment effect sizes. Some researchers have made effect-size comparisons across AB designs (Van den Noortgate & Onghena, 2003, 2007), and Ferron, Bell, Hess, Rendina-Gobioff, and Hibbard (2009) have used simulated data to demonstrate effect-size comparisons across multiple phases of a multiple baseline design. The use of HLM to conduct meta-analyses that provide average treatment effects across phases can be considered useful; however, it fails to consider growth patterns within phases as well as variability across and within phases. It does not take advantage of a strong feature of HLM: explaining variability with predictors.
A primary focus of IES is on the thorough investigation of intervention effectiveness (Whitehurst, 2003). Researchers can no longer report intervention effectiveness only, but should include more detailed information such as for whom and in what context an intervention is effective (Guralnick, 1999). Furthermore, the use of effect sizes to generalize a successful intervention to other students within the special education population is fundamentally problematic due to the underlying logic that assumes the original participants under investigation were selected from a random sample, and as discussed earlier, that is rarely the case. The use of any inferential statistic implies an effort to draw an inference about or to generalize to a population beyond the participants in or the circumstances of the study. It is beyond the scope of this article to take a position on the role of generalizability or on any of the other themes in the debate about whether to incorporate statistical analyses into SCR. That said, single-case researchers who want to apply inferential statistics to their data do have suitable options for doing so, and it is the purpose of this article to illustrate the use of one such option, repeated-measures HLM.
Kratochwill et al. (2010) state that multilevel models, which include HLM, are the least understood analytic procedures in terms of implementation and interpretation of results. If the procedures are not well known, then their use is not likely to become prevalent. Therefore, we will provide a brief overview of HLM and growth modeling. Then, we will provide two demonstrations of how HLM can be used to analyze single-case data from two different SCR designs. In both scenarios, the combined approach to examining single-case data will yield more information than the use of visual analysis alone. In the first demonstration, we will use growth HLM to accommodate the complex single-case design discussed earlier in this article. The large, single-case study incorporated a multiple baseline design with an embedded changing criterion. The combined use of growth HLM and visual analysis will provide meaningful information about the effect of a literacy intervention on individual student learning as well as provide tests of the average effects. By augmenting visual analysis with growth HLM, we will be able to quantify and to test statistically student growth trajectories within and across treatment phase changes, to quantify and to test statistically variability at initial-baseline probe and among growth trajectories, and to explore individual student characteristics as explanatory variables of student-performance variability at initial-baseline probe and among growth trajectories. In recent years, a mini-reversal has been included in studies to demonstrate the strength of a changing criterion design as an observable behavior increases and decreases in response to application and suspension of an intervention (Kazdin, 1982). However, in studies such as ours, withdrawal of the intervention is not appropriate during the examination of acquisition of an academic skill.
Next, we will provide simulated data to show the usefulness of repeated-measures HLM analyses within a hypothetical, single-case ABAB design that is also referred to in special education research as a reversal design. This demonstration will show how HLM can be beneficial when subtle, yet important changes in student performance may elude the eye or be impossible to determine visually for a collective group.
Brief Overview of HLM and Growth Modeling
A researcher investigating the extent to which one set of variables can predict an outcome variable could take a sample of participants, collect data from them, and apply multiple regression to analyze that prediction model. Suppose such an analysis occurred within each of 40 schools in a state. That would yield 40 regression equations, 1 for each school, and those equations would likely differ to some degree in the y-intercept and in the slope coefficients. In such a design, the students are said to be nested within schools, that is, students in 1 school have potentially relevant characteristics in common with each other that they do not have in common with students in the other schools (e.g., size of the school, average experience of the teachers, experience of the principal, affluence of the surrounding neighborhoods). Suppose further that theory supported a claim that the differences among the intercepts and/or the differences among the slopes could be explained, at least in part, by attributes of the school. In such a scenario, the researcher can use HLM to quantify the variance among schools in the intercepts and in the slopes, to include school-level variables in the explanation of variance of the student-level outcome or in the variance among the schools’ prediction equations, and to obtain statistical significance tests for those quantities.
Conceived in this manner, HLM requires a substantial number of participants to function properly. If 100 students are sampled from each school, and 40 schools are in the study, then the researcher would be collecting data from 4,000 students. Even in the largest studies, single-case researchers generally do not collect data from more than 20 to 30 students, let alone thousands of students from multiple schools, and many single-case studies have fewer than 10 participants total. It would thus seem that HLM has no place in SCR.
The relevance of HLM to SCR begins with a different nested relationship and with the longitudinal nature of SCR. Because most longitudinal studies involve measuring the same participants on multiple occasions, the data have some degree of serial dependence, which violates one of the central assumptions required for most inferential-statistical analyses. The analysis of longitudinal data, however, is common in group-design research via growth modeling and is one class of analyses used for repeated measurements of the same people. Repeated-measures HLM can be used for modeling growth because each person’s multiple measurements are nested within that person. Each person therefore has an individual growth equation, with an intercept and a slope (i.e., the growth parameter), and differences between these equations can be modeled as a function of attributes of the people (e.g., initial anxiety level, initial reading score, presence/absence of a developmental disorder).
In group designs, the large number of participants usually leads to relatively few data-collection occasions: as few as 3 in some studies and rarely more than 10. Although such a low number of measurement occasions is not optimal for HLM, having a large enough number of participants can compensate (Hox, 2002; Raudenbush & Bryk, 2002; Singer & Willett, 2003). This compensatory relation between the number of measurement occasions and the number of participants measured also works the other way: A sufficiently high number of measurement occasions can offset the estimation problems that would otherwise result from having relatively few participants. In other words, the typically small n found in SCR is not a problem for repeated-measures HLM, as long as data are collected from each participant many times, which is often the situation in a single-case design. Haardoerfer (2010) provides a thorough discussion of number of participants needed for repeated-measures HLM.
In addition to being able to accommodate data that have serial dependence, HLM is flexible enough to handle the very characteristics that make the quantification of special education data so complex. It can integrate unbalanced data that are collected at different measurement occasions that span amounts of time that vary across participants. HLM can accommodate data from heterogeneous groups that lack random assignment due to availability constraints and due to the assignment of students to certain groups that best match the goals of students’ individualized education plan (IEP). HLM considers the diversity of participants and contexts in that it accounts for sources of variability contributed by individual traits, skills, and/or characteristics. This is an approach that is consistent with a major goal of special education research, which is to identify those for whom a treatment is effective and those for whom that treatment is not effective.
Unlike other statistical analyses and approaches reviewed in this article, repeated-measures HLM can take into account all baseline activity and the examination of each phase shift as it relates to baseline phase(s) and/or treatment phases. By involving all data points in calculations, HLM can indicate whether changes in baseline activity systematically increase or decrease, thereby signaling potential trends in the data. Repeated-measures HLM therefore quantifies what visual analysis already does well, while providing the statistical significance tests major agencies are calling for in SCR.
Another contribution of HLM to visual analysis of single-case data is that it can reveal whether, before treatment begins, there is a statistically significant amount of variance in initial-baseline scores. HLM also offers a subsequent method of exploring variables that might explain such differences. It would be very interesting to know whether any significant variability among the students at initial-baseline status can be explained by student characteristics (e.g., IQ or socioeconomic status [SES]). This is valuable information to have in single-case studies because differences at the beginning of a study can sometimes explain differences in learning or behavioral outcomes.
For the treatment phases of single-case studies, average incremental-growth trajectories across all participants can be produced with growth modeling via HLM that are impossible to identify visually, especially with the largest single-case designs that incorporate many repeated measurements of behavior/learning and many phase changes. If, through growth HLM, enough variance is detected in growth trajectories of students within phases, the option is available to explore particular student characteristics and/or social phenomena that might account for the large variance.
Demonstration 1
In this section, we provide a sample of the data from an actual single-case design to demonstrate how HLM can augment visual analysis in an applied, complex SCR study. The data presented here are part of a larger literacy project designed to create an ILC for students with MoID (IES Grant R324A070144). Note that the purpose here is not to introduce the reading curriculum but to illustrate an ideal scenario in which the results from growth HLM can be a useful supplement to the results provided by visual analysis of single-case data. The data are from a delayed multiple baseline, across three instructional groups, with an embedded, changing criterion design. Three phases of instruction constitute an ABC design on each multiple baseline tier. Visual analysis is used to identify functional relations between instruction and learning performance for all students. Growth HLM is used to augment visual analysis by quantifying relations at the group level (e.g., report average growth per instructional session), statistically test variability at initial-baseline probe and among growth trajectories within all phases, as well as examine student prerequisite skills as possible predictors of variability among growth trajectories within phases.
Summary of Method
The participants were 11 students diagnosed with moderate to severe intellectual disabilities and were a subset of participants from the larger IES grant. Students were selected who had reached mastery of the first two subphases of instruction at the time of analysis. They were instructed in three groups, across two classes, in two different schools from two school districts in the Southeast.
The Print Knowledge (PK) subtest of the Test of Preschool Early Literacy (TOPEL) was administered to students prior to instruction (Lonigan, Wagner, & Torgesen, 2007) to measure students’ emergent-literacy skills and phonological-awareness skills. The Receptive One Word Picture Vocabulary Test (ROWPVT) was used to measure receptive vocabulary (Brownwell, 2000). All pretests were administered individually by researchers in a private testing area of the students’ schools. Preschool tests were used as most of the students’ developmental ages were in the preschool range.
A delayed, multiple baseline design, across the three instructional groups, with an embedded, changing criterion design was implemented. The DV was the number of times the student read the word correctly when the word was presented to the student on a stimulus card. The multiple baseline design was selected to demonstrate replication across groups. Embedding the changing criterion design accommodated the systematic increase in performance required across subphases as students mastered increasing numbers of sight words (Kazdin, 1982). Functional relations were determined by replication across tiers of the multiple baselines and across subphases of the changing criterion.
Provided here are data through one phase and two subphases of the first word set of the program that occurred in the same sequence for each reading group, represented on each tier of the multiple baseline. The first phase was the baseline phase; the first subphase of the reading intervention was acquisition of nouns followed by acquisition of adjectives in the second subphase. The consecutive three phases constituted an ABC design on each multiple baseline tier (Alberto & Troutman, 2009). Each reading group reached mastery for a phase collectively before beginning a subsequent phase. The mastery criterion for phases was a group average of 80% correct for two out of the last three sessions, with each student mastering at least 80% of the sight words presented.
All reading-performance data were recorded by teachers while they implemented sight-word probe sessions prior to teaching sessions. Teachers were provided daily data-collection sheets on which they recorded correct and incorrect student responses. The researchers monitored this process by observing teachers recording data, providing ongoing feedback, and answering teacher questions for a minimum of one daily sequence per week. Baseline stimuli included four nouns and four adjectives. Baseline data were collected for each student until the data were stable with no data point varying more than 50% above or below the mean. For the Nouns Subphase, four nouns were presented three times each for a total of 12 trials per probe session. During the Adjectives Subphase, four adjectives were presented three times each. In addition, each of the previously taught four nouns was presented once per session for a total of 16 sight words presented per probe session during the Adjectives Subphase.
Results
Visual analysis
In the interest of space, data are detailed for only three students (Shane, Tate, and Meg), one from each of the three groups; the full sample was used for the visual analysis and for the HLM analysis. As seen in Figure 1, the students’ individual reading data are depicted in a three-tier, multiple baseline design, thus providing a view of each student’s daily-reading performance. Daily-performance data are disaggregated such that one member from each group represents his or her respective group. The daily-performance data for each representative group member are displayed on Tiers 1 through 3 of the delayed, multiple baseline graph. Dashed lines across each phase indicate interim criterion for that phase, and the number in parentheses is the actual number of words needed correct for 80% mastery.

A delayed, multiple baseline design with an embedded changing criterion design depicting number of words read correctly
All three students demonstrated mastery of each phase. Baseline data were zero correct sight words across three sessions for Shane and Meg, and one or two words correct across four sessions for Tate. During some intervention subphases, students remained at or above mastery criterion longer than the necessary requirement that was two out of the last three sessions. This is because the students were receiving group instruction and the entire group had to reach mastery collectively before proceeding to the next subphase. As a result of the group instruction decision rule, students who reached mastery early in the intervention subphase did not move on until the other group members reached mastery.
Shane
During the Nouns Subphase, Shane had a range of 3 to 12 correct responses with a mean of 7.9 and reached mastery after 13 sessions. During the Adjectives Subphase, his correct responses ranged from 5 to 16 with a mean of 11, and he reached mastery after 13 sessions.
Tate
During the Nouns Subphase, Tate had a range of 7 to 12 correct responses with a mean of 10.9. He reached mastery after five sessions. During the Adjectives Subphase, his correct responses ranged from 6 to 16 with a mean of 14.9, and he reached mastery after four sessions.
Meg
During the Nouns Subphase, Meg had a range of 0 to 12 correct responses with a mean of 4. She reached mastery of nouns after 21 sessions. During the Adjectives Subphase, her correct responses ranged from 4 to 13 with a mean of 8.6, and she reached mastery after 22 sessions.
Hierarchical linear growth modeling
The results from the visual analysis provide a necessary, detailed inspection of the data for each individual. In contrast, the first hierarchical linear growth model, which included all 11 participants, quantifies and statistically tests aggregate measures of within-phase growth, reading values at the beginning of the study, phase shifts in reading score, and the variance across participants on each of these measures. A second hierarchical model was used to quantify and to test statistically the extent to which characteristics of the participants explain differences in their initial-baseline scores and differences in their growth rates during the intervention subphases. The single-case data were analyzed using HLM 6: Hierarchical Linear and Nonlinear Modeling software (Raudenbush, Bryk, Cheong, Congdon, & du Toit, 2004).
Model 1
The first hierarchical linear model contained three time variables (time during Baseline, time during Nouns, and time during Adjectives) and the two phase-shift variables (between Baseline and Nouns and between Nouns and Adjectives) as predictors of the outcome-variable reading. In this model, the slope for each time variable represents the growth during its corresponding predictor’s phase. The slope coefficients for the phase-shift variables act more like dummy variables, in that each is the average difference between the end of one phase and the beginning of the next.
The average reading score at the beginning of the Baseline Phase was not statistically significant, t(10) = 1.640, p > .05, and the change in reading score during the Baseline Phase also was not statistically significant, t(10) = 0.810, p > .05. These results indicate that at the beginning of the study, the children did not know the words already, and their knowledge during the Baseline Phase did not increase. Of note is that there was significant variance among the children in their initial scores, τ00 = 12.088, χ2(10) = 48.265, p < .001, meaning that although the average score was not significantly different from 0, there was significant variability among the students’ scores at the beginning of the study. The change in score during Baseline Phase, however, did not vary significantly across the participants, χ2(10) = 4.122, p > .05, so the finding of no change, on average, during the Baseline Phase is consistent across the participants.
From the final baseline measurement to the first measurement in the Nouns Subphase, the change in reading scores was not statistically significant, b40 = 2.046, t(10) = 1.792, p > .05. During the Nouns Subphase, reading scores significantly improved, t(10) = 4.640, p < .01, increasing, on average, by 0.707 words per session, and the variance of the individual growth rates was statistically significant, χ2(10) = 442.648. These results indicate that although there was no change in reading scores immediately after treatment began, there were, on average, significant gains with each session. There was significant variability also across students in the average amount each student’s score changed per session.
Between the final session of the Nouns Subphase and the first session of the Adjectives Subphase, there was a statistically significant drop in reading scores, t(10) = −2.535, p < .05, with scores decreasing by an average of 2.495 words. A drop from the Nouns Subphase to the Adjectives Subphase was anticipated given the introduction of new words. Growth during the Adjectives Subphase was statistically significant, t(10) = 4.345, p < .01, with reading scores increasing, on average, by 0.600 words per session, meaning that the intervention during the Adjectives Subphase led to an average increase with each session. As seen during the Nouns Subphase, there was statistically significant variance in the growth rate, χ2(10) = 103.5936, p < .001, during the Adjectives Subphase.
Model 2
For the second model, the two student-level pretest scores, PK subtest of the TOPEL and ROWPVT, were included. ROWPVT was introduced to the model to attempt to explain differences between students’ scores at the beginning of the Baseline Phase. PK was introduced as a potential way to account for differences in the students’ growth rates during the Nouns Subphase and during the Adjectives Subphase.
The Level-2 predictors were statistically significant in all three equations. Holding the other predictors constant, a unit increase in a student’s score on the ROWPVT predicted, on average, an increase of 0.185 in the initial reading score, t(9) = 5.274, p < .001. Given the range of scores on the ROWPVT, the 0.185 increase per session is sizable. PK accounted significantly for differences among the growth rates during the Nouns Subphase, t(9) = 2.498, p < .05, with a unit increase in PK resulting, on average, in an increase of 0.0282 in the rate of word acquisition per session. PK was also a significant predictor of the growth rate during the Adjectives Subphase, t(9) = 6.480, p < .001, with the growth rate increasing by 0.0588 per unit increase on the PK. Each point higher on the PK at the beginning of the study led, on average, to 0.0282 more words learned per session during the Nouns Subphase and 0.0588 more words learned per session during the Adjectives Subphase. Despite the ROWPVT and the PK being significant predictors of the intercept and of the treatment-subphase slopes, respectively, there remained statistically significant variance (p < .001) in all of the Level-1 coefficients (except for the baseline slope, the variance of which was not significant in either model).
Discussion
A first step in discussing the combination of visual analysis and HLM is to consider the degree of consistency between the two techniques when they were both used to evaluate the same quantity. In our single-case sight-word study, the facets examined by both approaches were whether knowledge of sight words existed before the intervention, the presence or absence of baseline trends, increases or decreases in reading as a function of reading instruction during sessions, and increases or decreases in reading levels at transitions between phases.
Both visual analysis and HLM support the conclusion that at the beginning of the study, most of the children did not know any of the words on which they were being tested. HLM then provided statistical evidence to support the already well-established guidelines visual analysts use to define the Baseline Phase; change in the participants’ knowledge of the words did not change statistically during the Baseline Phase, and the change in knowledge of words was similarly flat across all the participants during the Baseline Phase. It was identified easily through visual analysis that there was essentially no change in reading scores from the end of the Baseline Phase to the beginning of the Nouns Subphase, and HLM found that that shift was not statistically significantly different from 0.
For the Nouns Subphase, visual analysis revealed a functional relation between reading and instruction during sessions. HLM also indicated statistically significant growth during the Nouns Subphase. Visual analysis revealed a clear drop in scores when the assessment shifted from Nouns to Adjectives; HLM also detected this drop, flagging it as statistically significant. Visual analysis again indicated a functional relation between reading and instruction during the Adjectives Subphase, and HLM again concurred, deeming the growth statistically significant.
That visual analysis and HLM agreed in all points of analytical overlap, although interesting from a validity standpoint, does not necessarily establish a place for HLM in SCR. The real utility of HLM is the information it adds to the mix. Although visual analysis revealed a functional relation in both instructional subphases, HLM provided specific values for the average rates of word acquisition per instructional session, 0.707 and 0.600 for the Nouns and Adjectives Subphases, respectively. HLM revealed also that there was significant variance in these growth rates during each subphase, indicating that the magnitude of the change varied among the students, while on average, a functional relation was present in both subphases.
Beyond quantifying specifically and testing statistically the trends and fluctuations that were observed via visual analysis, a unique contribution of HLM to SCR comes from the results of the second hierarchical model tested. Visual analysis alone does not provide a method for explaining variability in the DV or in rates of change in the DV over time. In the second hierarchical model, variables were introduced as predictors to try to explain the differences among students at the beginning of the study and to try to explain the large degree of variability observed in the growth rates. In this study, it was shown not only that the initial scores were statistically different from each other but HLM revealed also that these differences were, to a statistically significant extent, a function of pretest scores on ROWPVT.
Pretest levels of PK significantly accounted for the variance in growth rates during the Nouns Subphase and during the Adjectives Subphase, with higher scores on PK predicting higher growth rates during both subphases. In other words, student success in learning to read was significantly and positively linked to levels of preexisting phonological skills. This is interesting not only from a methodological and instructional standpoint but also from the standpoint of reading theory. According to Ehri (2005), based on research findings with other populations of readers, phonological skills must be well developed to provide a necessary foundation for learning sight words. The phonological skills referred to by Ehri are higher level skills such as decoding and orthographic-processing skills. In our study, the phonological skills measured by the PK subtest prior to reading instruction were the most basic prerequisite-reading skills (e.g., emergent-literacy and phonological-awareness skills such as segmenting). Some of the students had developed these basic skills prior to treatment and learned faster; nevertheless, all the students mastered nouns and adjectives including the students who had no prerequisite-reading skills prior to treatment. Due to the previously held belief that students with MoID do not have the potential to learn to read phonetically, phonological skills (e.g., phonological-awareness, word-analysis skills, and spelling) are rarely taught or measured. The results of the present study suggest that students with MoID without well-developed phonological skills can acquire sight words; however, faster acquisition will occur if they are taught even the most basic prerequisite-reading skills that have been shown to facilitate reading for students who are developing typically.
Demonstration 2
In this section, we provide simulated data to show how repeated-measures HLM analyses can augment visual analysis within a single-case ABAB design. An ABAB design is a commonly used SCR design through which intervention effectiveness can be determined if a target response changes in frequency or duration when an intervention is introduced and then reverts to preintervention baseline levels on removal of the intervention. Target behaviors that are appropriate for an ABAB design are ones that can change in a desired direction in the presence of a discriminative stimulus and then temporarily revert to baseline levels when not under stimulus control. This design is not appropriate for academic behaviors because once academic skills are learned; they cannot be “unlearned” when the intervention is suspended. ABAB designs have been used successfully during functional communication training (FCT) research for the purpose of teaching communication skills to students who are nonvocal in academic and therapeutic settings (Davis, Fredrick, Alberto, & Gama, 2012). We will provide a simulated data set that could represent a scenario in which an ABAB design is implemented to examine the effectiveness of an intervention (e.g., delivery of token reinforcement) on student performance of a target response (e.g., use of a new communication device) under conditions that differ on the availability of the reinforcement. Through the combined use of visual analysis and repeated-measures HLM, we will be able to visually ascertain the impact on the number of target responses when the intervention is applied and removed across phases, to test whether or not average number of target responses change to a statistical extent across phases, and examine the data for differences among the target responses of the students within Baseline and Intervention Phases.
We simulated data for 15 hypothetical participants. A total of 5 measurements are provided for each participant in the A1 Phase, 10 measurements for each participant in the B1 Phase, and 7 and 10 measurements for the remaining two phases, respectively. The data were simulated and analyzed in SAS 9.3. The first A Phase (A1) is an initial-baseline condition with no treatment intervention in effect, and the first B phase (B1) is the condition under which the intervention is first introduced. The second A phase (A2) is a return-to-baseline condition, and the intervention is then reintroduced at the beginning of the second B phase (B2). A functional relation is determined if the DV changes in a desired direction only when the IV is present, and returns to, or near, baseline levels when the IV is not present.
Results
Visual analysis
In the interest of space, data will be depicted for only one hypothetical participant. The full set of simulated data will be used for the HLM analyses. As seen in Figure 2, the student’s individual data are depicted in an ABAB design, thus providing a view of the student’s daily performance.

An ABAB design depicting number of student responses for one hypothetical participant
A functional relation between the IV and DV was revealed each time the mean level of performances was observed to increase in response to the introduction of the intervention. During the first Baseline Phase (A1), the number of target responses ranged from 2 to 3 across 5 sessions with a mean of 2.75. During the first Intervention Phase (B1), the range of target responses ranged from 3 to 7, with a mean of 5.7 across 10 sessions. The number of student target responses dropped during the Return-to-Baseline Phase (A2) to a range of 2 to 3, with a mean of 2.9, across 7 sessions. Student target responses increased again during the second Intervention Phase (B2) to a range of 4 to 7, with a mean of 5.8, across 10 sessions.
Repeated-measures HLM
The simulated data set contains measurements from 15 hypothetical participants who emitted a target response of an average of 2 times across 5 measurements in Phase A1 with a mean of 0 and a standard deviation of 1. During the B1 Phase, responses increased to an average of 5 with the same variance across 10 measurements, but with an additional variance across participants, indicating that the intervention affected students’ responses differently. Target responses then dropped back down to an average of 2.5 in Phase A2 across 7 measurements and back up to 5 in Phase B2 across 10 measurements. Hence, the simulated data represent a study where the average number of target responses during the A phases are not equal, but are equal during the B phases. In addition, participants did not vary in the A phases from each other, but did so in the B phases. The generated data had phase means of 1.82 (SD = 1.03), 4.98 (SD = 1.38), 2.57 (SD = 0.91), and 4.74 (SD = 1.42), respectively.
The HLM analyses found all effects simulated in the data. The means estimated in each phase matched to two decimal places to those found in the actual data. Furthermore, the results indicate that there is a statistically significant difference of 3.16 between A1 and B1 mean levels, t(19) = 10.64, p < .0001, as well as a significant difference of 2.17 between A2 and B2 mean levels, t(17.1) = 7.13, p < .0001. This means that target responses increased to a significant extent in conditions during which the intervention was in place. Also revealed by HLM was a statistically significant difference of 0.75 between the mean level of the first Baseline Phase, t(26.9) = 4.93, p < .0001, and the mean level of the second Baseline Phase, meaning that even though target responses changed significantly when the intervention was in place, the mean levels did so to a significantly different extent, t(27.9) = −0.61, p = .5493. No statistically significant difference was found between the mean levels in the two Intervention Phases, meaning that the intervention impact on target responses was of similar magnitude during both phases. However, the HLM analysis found a statistically significant difference in the variance between participants in both Intervention Phases (τ11 = 1.0226, Z = 2.43, p = .0076 and τ33 = 1.1572, Z = 2.45, p = .0071). This indicates that even though the group, on average, responded similarly to each introduction of the intervention, within each group participants differed from each other in terms of how they responded to the intervention. For all significant differences, the p values were below .01.
Discussion
A first step in discussing the combination of visual analysis and HLM is to consider the degree of consistency between the two techniques when they were both used to evaluate the same quantity. In our simulated study, the behavioral occurrences corroborated by both approaches were few or no performances of the target response occurred before the intervention; increases in the number of target responses occurred when token reinforcement was available during Intervention Phases; and the number of target responses decreased to near initial-baseline levels when the availability of reinforcement was removed.
In addition to quantifying and testing statistically all visual-analysis findings, HLM enhanced visual analysis in other ways. The change in the number of target responses observed while the intervention was in effect does not appear to be dramatic. However, HLM revealed that changes in mean levels, on average, across all phases occurred to a statistically significant extent. If visual analysis alone was conducted, it could be possible for the magnitude of the treatment impact to be dismissed. In our hypothetical study, a statistically significant increase of 3.16 and 2.17 communication responses can have important practical significance in academic settings for students who are nonvocal and have a very limited ability to communicate.
Repeated-measures HLM revealed further information. Even though the increase in responding appeared to be low for all students, the intervention had a variable impact on students. Some students increased responding more than others—to a statistical extent. In an actual study, researchers would then have the option to explore individual characteristics and/or phenomena that might account for the variability in responding. Possible explanatory variables could include students’ receptive or expressive language ability, level of difficulty of the communication response, or whether the academic task was novel.
Finally, visual inspection of 15 single-case graphs is possible to determine intervention effectiveness at the individual level. However, comparing 15 graphs and extracting generalizations from the group is impossible through visual analysis alone. Repeated-measures HLM allows researchers to identify important patterns of responding that occur at the group level and then to conduct finer-grained analyses that produce even more knowledge about the effects of an intervention.
General Discussion
The primary purpose of this article is to describe and to demonstrate several ways that HLM can be used to augment visual analysis in SCR. Because visual analysis has received some criticism from within SCR circles and there is a push to incorporate statistical analyses in SCR to facilitate identification of EBIs, there have been attempts to apply statistics that are not well suited to single-case data. Treating the multiple measurement occasions in single-case designs as nested in the people from whom they are collected, repeated-measures HLM fits well with single-case methodology and complements visual analysis.
Although visual analysis provides information about the behavior over time of each individual studied, HLM can provide aggregate estimates of treatment impact and estimates of the influence of person-level variables on treatment impact, along with inferential, statistical tests of these estimates. Such a combination of results is of particular utility to single-case researchers who implement visual analysis of their graphed data as the primary means by which to assess potential functional relations and who wish also to employ analysis techniques for the collective set of data. In this way, researchers can stay true to the analytic principles of SCR while answering the concerns raised by critics of visual analysis and answering the call of agencies to incorporate inferential statistics. HLM contributes to the effect-size conversation in that researchers have argued that quantitative procedures would be beneficial to SCR because their use would increase the ease with which meta-analyses could be conducted, thereby facilitating generalized assertions about the utility of interventions across a series of studies (D. M. White et al., 1989).
For the identification of EBIs, it would be helpful to include statistical significance with SCR studies. For consumers of evidence-based instructional curricula and practices, it would be helpful to include the practical significance (e.g., students with MoID can learn an average of X sight words per teaching session). It also would be valuable for researchers to be able to inform consumers that the number of sight words students learn during each lesson, on average, increases or decreases by a given amount, dependent on certain prerequisite-skill levels. Populations of learners with MoID acquire academic skills very slowly, and that can equate to frustration and exhaustion for teachers who work with them on a daily basis. If the specific amount of stimuli learned per teaching session is quantified and available to teachers, then it might enhance teachers’ motivation and serve as encouragement. The HLM coefficients could affect a curriculum-selection process. Sometimes educators tend to focus on what a particular student is not learning or on the students who are learning the slowest. HLM is a way of providing actual numbers to make the accomplishments more concrete and visible to the educator or caregiver.
The introduction of HLM has reconceptualized the measurement of change in learning. In the facile, yet poignant, words of a well-known contemporary artist “Times, They Are a-Changin” (Dylan, 1964). Researchers now have the ability not only to determine if a behavior change has or has not occurred due to an intervention but also to express the extent to which the behavior changed, and to explore possible environmental phenomena and idiosyncratic characteristics that influence behavior change in response to a particular intervention.
Research could benefit from further examinations of the combined use of HLM and visual analysis with additional SCR designs that are appropriate for other types of observable behaviors. In addition, we encourage the further use of real data from applied research that is relative to current issues in educational research. Kazdin (1982) states that within SCR alone, it is difficult to identify interactions between participant characteristics and treatments that contribute to generality of results. This is another way that the use of growth HLM enhances SCR.
Footnotes
Authors’ Note
The opinions expressed are those of the authors and do not represent views of the U.S. Department of Education.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interests with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R324A070144 to Georgia State University.
