Abstract
Replication is a critical aspect of scientific inquiry that presents a variety of challenges to researchers, even under the best of conditions. We conducted a review of replication rates in special education journals similar to the review conducted by Makel et al. in this issue. Unknowingly conducting independent reviews allowed for an unexpected opportunity to examine how two teams of researchers attempted to replicate a previously published study and explore similarities and differences between the outcomes. In our review, we identified 70 replication studies published between 1997 and 2013, indicating that 0.41% of published articles in special education journals are replication studies. Similar to findings reported by Makel et al., our review indicates that most replications are successful and that successful replications are more likely when author overlap occurs. Although there are similar patterns in the two data sets, an examination of exact agreement on article inclusion revealed an agreement rate of 15.2%. Possible explanations for the discrepancy and implications for future directions are provided.
Although science is an imperfect way of knowing and understanding the world, it is likely our best hope for enhancing educational outcomes for children and adolescents with disabilities. Progress is made over time as new experiments are conducted to replicate or falsify prior findings and to examine the generalizability of findings by systematically altering a key aspect of a previous study (e.g., participants, setting, component of the independent variable). Scientific findings are never presumed to be final. Instead, science pursues “an infinite, yet attainable aim: that of ever discovering new, deeper, and more general problems, and of subjecting our ever tentative answers to ever renewed and more rigorous tests” (Popper, 1959, p. 281).
Considering the importance of replication in scientific inquiry, Makel and Plucker (2014) conducted a review of the top 100 education journals to determine how frequently replication studies were published. Their data indicated that replication studies accounted for 0.13% of education articles published in the reviewed journals. Berliner and Glass (2015) described this rate of replication as “frighteningly low” (p. 13) and suggested that replication of effects by researchers other than those involved in the original study or program design, not the randomized control trial, should be considered the real gold standard of research.
Echoing concerns in the broader social sciences, Cook (2014) argued that a lack of replication undermines the ability for special education researchers to guide policy and practice with scientific research. In response to this call for an examination of replication in the special education research literature, we conducted a conceptual replication of Makel and Plucker’s (2014) earlier review of education literature. That is, we used procedures similar to theirs (i.e., reviewed all studies that included the term replicat* in the searchable record) to estimate the prevalence of replication studies in special education research. Our aim was to determine whether the low rate of published replication studies found in the general education literature was also evident in the special education literature. Moreover, as multiple studies are necessary to deem an intervention as an evidence-based practice (Gersten et al., 2005; Horner et al., 2005), we believed that this exploration could highlight areas in which additional replication studies are needed. We also examined questions specific to special education, including the prevalence or success of replications by research methodology, type of interventions, and disability category of participants. We hypothesized that the rate of replication would be somewhat higher than observed in the general education literature due to the emphasis of replication in a commonly used special education research methodology, single-case design (Gast & Ledford, 2014).
In the interest of conducting an independent replication, our team used Makel and Plucker’s (2014) manuscript as a guide without consulting the authors. Perhaps naively, we thought replicating a review from a published manuscript would pose minimal challenge. As we attempted to replicate procedures, we met several roadblocks and were quickly reminded that “no plan survives first contact with implementation” (Weir, 2014, p. 49). Databases available at our university differed from those used in the original study. Access to indexing article count was limited. Substantially more time than initially estimated was required to refine article search procedures, operationalize definitions, and establish interrater agreement. However, after 7 months of work, our data analysis was complete. It was precisely at this time that we discovered that Makel and colleagues were in the process of finalizing their own review of the special education research literature.
Our initial reaction—frustration—is instructive. Despite the many pages appearing in this and other journals attesting to the value of replication as a pillar of scientific inquiry, academics typically regard the duplication of work as less important (Duncan, Engel, Claessens, & Dowsett, 2014; Ioannidis, 2012). Indeed, the work of Makel and others (Hubbard & Armstrong, 1994; Makel & Plucker, 2014; Makel, Plucker, & Hegarty, 2012), which suggests that replication studies are infrequently published, gave us cause to wonder whether we had just wasted 7 months preparing a manuscript that was now potentially unpublishable. We initiated a conversation with Makel to explore whether he and his colleagues would be open to discussing whether, and if so how, findings from our review might have value.
Makel agreed to a dialogue and informed us that his manuscript was going to be included in this special issue and he offered to approach the co-editors to assess their interest in our article. After some discussions between authors and co-editors, the group decided that our co-replications of Makel and Plucker (2014) would add “a delicious meta” component to the special issue (M. C. Makel, personal communication, June 17, 2015). This occurrence presented us with an unexpected opportunity to compare findings from two similar replication projects conducted by different teams of researchers as a way of contextualizing challenges associated with conducting replication studies.
The purpose of this manuscript is to report findings from our review on the rate of replication studies published in special education journals, to further describe these studies, and to compare results with Makel and Plucker’s (2014) original review of the general education literature. As our study was designed as a conceptual replication, our initial exploration was driven by the same research aims as Makel and Plucker. Specifically, we examined the number of replications published over time, the types of replications being conducted (i.e., direct, conceptual), the extent to which studies were published by the same authors in the same journals, and the reported success rate of replication studies.
We extended our analysis to also examine several features of replication studies not explored in the original review. We examined whether the number or success rate of replications varied by research methodology (e.g., group experimental, single case), type of intervention (e.g., academics, behavior), or disability category of participants. Finally, given the opportunity to compare our findings with those of Makel et al. (2016, In this issue), we assessed the similarities and differences across the two reviews of the special education research literature and explored possible explanations for any differences. We conclude with thoughts on what can be learned from conducting replication studies.
Method
We designed our study to replicate the procedures outlined by Makel and Plucker (2014) for identifying published replication studies. In the next sections, we describe our attempts to closely follow Makel and Plucker’s methods, and we highlight instances in which departures from their procedures were required. We explain our process for identifying and coding articles, provide definitions used during coding, and describe our procedures for comparing our findings with those of Makel et al. (2016).
Article Identification
In November 2014, we identified all special education journals that were included in the most recent Thomson Reuters Journal Citation Reports (JCR; Social Sciences Citation Index, Thomson Reuters, 2013). For each journal, we used JCR to record total articles (i.e., total citable items excluding reviews) published for each year between 1997 and 2013. This departure from Makel and Plucker’s (2014) search of the full publication history was required because our university’s JCR subscription included only these years (i.e., studies published prior to 1997 were not included in the subscription, studies published in 2014 had not been indexed at the time of our search). In cases when JCR did not report a total article count for an included journal for a year within the 1997-2013 time frame, we performed a hand search to calculate the number of articles published during that year. Book reviews, tables of contents, and miscellaneous items (e.g., errata) were not included in the count to match JCR’s article counting method.
After consulting with an information specialist from our university library, we used four electronic databases for our search (i.e., PsycInfo, PsycARTICLES, ERIC, and Proquest Central Next). This departure from Makel and Plucker’s (2014) use of Web of Science was required as (a) our university’s subscription, though nominally identical, offers less comprehensive access than the database featured in the original study (e.g., search years and full-text searching were limited) and (b) our target journals were better indexed in the selected databases (e.g., more journals were indexed by full text). Next, we used the selected databases to identify studies within each journal that included the term replicat* in the article record (e.g., abstract, title) or in full text if full-text searching was possible. Data were combined for journals whose publication title changed over the time span of the search. The search for each journal was exported and printed. Note, JCR lists 37 journal titles; however, two listings refer to the same publication whose title changed (British Journal of Developmental Disabilities [1997–2011], International Journal of Developmental Disabilities [2012–2013]).
To qualify as a replication of a previously published study, the article had to meet the following criteria: (a) Authors explicitly used the variation of the term replicat* to indicate that the study was a replication of a previously published, specifically cited, identifiable study; and (b) the previous study had to be published in a journal, chapter, dissertation, or other publicly accessible publication. Articles were included regardless of methodology. Articles were excluded if (a) it was unclear which previously published study was being replicated (e.g., reporting replicating previous research with no citation); (b) it was clear that the authors had not replicated a previous study, but had only used a similar method or portion of a curriculum; (c) the term replicat* was used nonspecifically (e.g., indicated that findings replicate previous findings or that current results should be replicated in future research); (d) the study being replicated was not publicly accessible (e.g., conference presentation, in press, under review); or (e) the replication occurred within the context of the manuscript (e.g., two experiments in one manuscript, multiple cases reported in a single-case design).
Inclusion and exclusion criteria used by our research team were operationalized to closely follow those used by Makel and Plucker (2014). However, for our research team to establish sufficient interrater agreement, we further defined our criteria. Our only intentional departure from Makel and Plucker was the exclusion of within-manuscript replications. We decided to exclude within-manuscript replications from our review because single-case design studies often include this element (e.g., multiple baseline across behaviors for multiple subjects). As Makel and Plucker did not include single-case design studies, we believed excluding these types of replications in our review would facilitate the clearest comparison of our results with theirs.
Before search records were coded for article inclusion, the research team received training on identifying replication studies within a subset of search records (i.e., abstracts for approximately 250 articles with access to full text as needed) until interrater agreement (total agreements − total disagreements / total possible × 100) with the first author was 95% or greater. Then, one team member independently reviewed each printed search record, referencing full text when necessary, to determine whether an article qualified as a replication of a previously published study. All records were double scored by a second member of the research team. Interrater agreement was 92.7%. Disagreements in coding were resolved through article review, discussion, and consensus at a meeting of the research team. The formula for calculating interrater agreement and the process for resolving disagreements remain consistent throughout the remainder of this manuscript unless otherwise indicated.
Article Coding
The research team received training on a coding protocol used to extract relevant data from each article. Team members recorded identifying information (e.g., year, title, volume) for each article that met inclusionary criteria. Similar to Makel and Plucker (2014), team members recorded the type of replication, authorship and journal overlap, and replication success for each study. Departing from Makel and Plucker’s coding scheme, team members also recorded research methodology, intervention type, and disability status of participants for each study. Team members practiced coding articles using the coding protocol until each established 95% or greater reliability with the first author on two separate studies. These practice articles were then assigned to another team member for the actual data extraction. A random sample of 40% (n = 28) of articles was selected for blinded double scoring to evaluate interrater agreement. A member of the research team other than the initial coder extracted data from this sample of studies. Mean interrater agreement was 95.5% across variables coded (SD = 5.6). Data from the coding protocols were then independently entered into two redundant databases that were compared for data entry errors. Errors were corrected prior to data analysis.
Type of replication
To further operationalize distinctions between direct and conceptual replications, we reviewed the literature and selected definitions provided by Hubbard and Armstrong (1994) to identify replication studies as being either direct or conceptual. This departure from Makel and Plucker (2014) was necessary because the authors did not include definitions of replication types in their manuscript.
Direct replications were defined as follows:
A duplication of a previously published empirical study that is concerned with assessing whether similar findings can be obtained upon repeating the study. This definition covers what are variously referred to as “exact,” “straight,” or “direct” replications. Such works duplicate as closely as possible the research design used in the original study by employing the same variable definitions, settings, measurement instruments, analytical techniques, and so on. An example would be repeating the study with another sample drawn from the same population. (Makel & Plucker, 2014, p. 4)
Conceptual replications were defined as follows:
A duplication of a previously published empirical research project that serves to investigate the generalizability of earlier research findings. The extension does not alter the conceptual relationships involved in the original study, but instead tests them by making changes in some aspects of the design. (Makel & Plucker, 2014, p. 5)
Team members could also indicate that the type of replication could not be determined from the article. In other words, authors did not report the type of replication conducted and team members could not determine the type of replication from the information provided within the article.
Author and journal overlap
Author overlap was coded “Yes” if either one or more of the authors of the replication study were included as an author on the original study. Journal overlap was coded “Yes” if the replication study was published in the same journal, including journals whose name changed during the time frame of the review, as the original study.
Replication success
This item was coded “No” if the authors reported that the findings did not replicate those of the original study and “Yes” if the authors reported that findings did replicate those of the original study. Findings of decreased magnitude were considered a success as long as statistical significance or a functional relation was maintained. In addition, we allowed authors to note differences in successful replications as long as the differences did not directly conflict with the findings in the original study. If authors reported that some findings did replicate and that some did not, we coded the replication success as “Mixed.” Finally, team members coded this item as “not indicated” if it was unclear from the manuscript whether the replication attempt was successful. We did not review original studies to confirm authors’ report of replication success.
Method
We identified the research methodology for each included article. Following guidance from Shadish, Cook, and Campbell (2002), “Quasi-experimental or group experimental designs” included randomized control trials, regression discontinuity designs, time-series designs, designs with matched control groups, designs that used a control group with or without a pretest, and group designs without a control group (e.g., one-group pre–post design). Following guidance from Gast and Ledford (2014), “Single-case designs” included studies that used single-case logic to experimentally evaluate the functional relation between independent and dependent variables. Designs included multiple baseline, ABAB, alternating treatment, and changing criterion. “Qualitative designs” used the approach described by Brantlinger and colleagues (Brantlinger, Jimenez, Klingner, Pugach, & Richardson, 2005). These nonexperimental designs relied on researcher summaries of observations, interactions, discussions, focus groups, and artifacts. “Correlational and descriptive designs” provided quantitative data but were not experimental. Studies in this category either explored relations between variables or provided a descriptive summary of collected data. If a study included elements of both experimental and nonexperimental designs, the experimental design code was used.
Type of intervention
For experimental studies, the type of intervention being evaluated was also coded. This variable was allowed to be nonmutually exclusive; thus, an article could be included under numerous categories. If no intervention was provided (e.g., comparison of alternate methods of stimulus preference assessment), this variable was coded “Not applicable.” “Reading” interventions targeted skills including phonological awareness, word reading, letter sound knowledge, decoding, fluency, vocabulary, and comprehension. “Mathematics” interventions targeted skills including counting, one-to-one correspondence, number identification, calculation, telling time, money, and measurement. “Writing” interventions targeted skills including writing letters, words, sentences, or longer pieces of connected text. “Content area” interventions targeted any academic subject not included under reading, mathematics, or writing (i.e., science, social studies, language arts [e.g., grammar], foreign language, or music). “Nonacademic behavior” interventions targeted behaviors that did not fall under the other academic areas (e.g., on-task behavior, attention, engagement, decreasing problem behavior, functional skills, social skills, verbal behavior).
Disability category
We coded the Individuals With Disabilities Education Improvement Act (IDEA; 2004) disability category of children or adolescents included in studies when applicable. This variable was nonmutually exclusive. Ten categories of disability were used: autism spectrum disorders, sensory impairment (e.g., deaf-blindness, deafness, visual impairment, blindness, hearing impairment), emotional disturbance, intellectual disability, multiple disabilities, orthopedic impairment, other health impairment, specific learning disability, speech or language impairment, and traumatic brain injury. We also included codes for individuals without disabilities who were not identified as being at risk of disability or school failure, individuals without disabilities who were identified as being at risk of disability or school failure, individuals without disabilities whose at-risk status was not reported, and individuals with disabilities whose disability category was not specified. In addition, a code was used when disability status was unclear or not reported.
Comparison With Makel et al.’s Review
After we had completed the previously described analyses, we obtained Makel et al.’s (2016) analysis and their data set that identified reviewed articles and rationale for inclusion or exclusion for the same time period as our review. We reviewed similarities and differences in our analyses and we calculated exact article agreement between our reviews (i.e., number of articles included in both reviews / total number of articles reviewed by one or both teams × 100). In addition, we examined whether one or both teams reviewed articles. For articles that were included in Makel et al.’s review yet reviewed and excluded by our team, we evaluated our rationale for the exclusion using previously described procedures. Coding was conducted by the first author and checked for accuracy by members of the research team.
Results
We first present results for our review of the special education research literature. We follow this with a presentation of findings from the comparison of our results with those of Makel et al. (2016).
Review Results
We identified 2,231 out of 17,111 articles that included replicat* in the searchable record. Of these, 70 were deemed to be replication studies according to criteria described above. The search term replicat* occurred within the abstract for 32 (45.7%) of included studies. This resulted in an overall replication rate for special education journals of 0.409%. A regression of year on annual replication rate revealed no association (p = .969) indicating that between 1997 and 2013, there was no change in the publication rate of replication studies. The restricted date range of our review may have influenced this outcome.
Regarding the types of replications being conducted, 63 (90.0%) studies were conceptual and four (5.7%) were direct (see Table 1). Three (4.3%) studies included insufficient information to determine the type of replication conducted. Author overlap with the original study was found for 58.6% of studies. Most studies (68.6%) were not published in the same journal as the original study. Authors of 65.7% of studies interpreted their findings as successfully replicating the original study, 1.4% indicated a replication failure, 27.1% reported mixed outcomes, and 5.7% of studies included insufficient information to evaluate authors’ interpretation of success. Authorship of articles most frequently included one or more authors from the original study (58.6%). There was a trend for articles with author overlap to report a higher proportion of successful replications (29 of 41; 70.7%) compared with articles with no author overlap (17 of 29; 58.6%).
Replication Type, Authorship, Journal, and Research Methodology by Replication Outcome.
Note. aIncludes Quasi-Experimental and Group Experimental Designs.
We also examined the research methodology, types of intervention, and disability category of participants in identified replication studies. Studies using correlational or descriptive methodology accounted for 51.4% of identified articles, single-case design studies for 35.7%, quasi-experimental or group experimental studies for 10.0%, and qualitative methods for 2.9% of studies (see Table 1). Authors who used experimental methodologies (i.e., group, single-case design) most commonly identified replications as being successful (i.e., 6 of 7 quasi-experimental or group experimental studies; 20 of 25 single-case design studies). Authors of correlational or descriptive studies also reported a high number of successful replications (20 of 36 studies); however, these authors also reported having mixed outcomes (13 of 36) more commonly than authors using experimental methods.
For studies evaluating the efficacy or effectiveness of an intervention in an experimental design (n = 31), we coded the type of intervention (see Table 2). Interventions focused on improving nonacademic behaviors (e.g., inattention, problem behavior) accounted for 61.3% of intervention studies. Writing interventions accounted for 22.6%, reading interventions for 12.9%, and content area interventions for 3.2%. No replication studies evaluated a math intervention. Interestingly, one affiliated team of researchers, S. Graham, K. Harris, and researchers they had trained, produced five of the seven writing studies (e.g., Saddler, Behforooz, & Asaro, 2008; Sexton, Harris, & Graham, 1998). The 12 academic focused interventions involved a balance of experimental methods. Interventions targeting nonacademic behavior were predominantly single-case design (17 of 19 studies).
Intervention Type by Research Methodology.
Note. aIncludes Quasi-Experimental and Group Experimental Designs.
The disability category of participants in studies is presented in Table 3. Most frequently, replication studies included participants with intellectual disability (27.1%), specific learning disability (24.3%), or autism spectrum disorders (17.1%). Participants with traumatic brain injury (1.4%), emotional disturbance (4.3%), and orthopedic impairment (5.7%) were less frequently represented. Participants without disabilities of varying risk status were included in 17 studies (24.3%).
Disability Category of Participants Included in Replication Studies.
Comparison Results
After obtaining Makel et al.’s (2016) data set, we conducted an analysis to compare our results with theirs. Between the years of 1997 and 2013, we identified 70 replication studies and a replication rate of 0.409%, compared with 109 replication studies and a replication rate of 0.522% identified by Makel et al. during this same time frame. Overall, our team was more likely than Makel et al. to classify replication studies as conceptual (90.0% vs. 51.5%) and to consider the replication a mixed success (27.1% vs. 12.7%). Both teams identified a trend for replication studies published by the same authors as the original study to be more likely to be considered a success.
Overall, our exact agreement for article inclusion was 15.2%. The two research teams agreed on inclusion for 24 out of 158 studies, including three of which Makel et al. (2016) identified in a separate category for single-case design replications. For all of the 24 on which both teams agreed, the term replicat* was located in the article’s abstract. Note that we excluded studies in which study replication occurred within the manuscript, excluding a majority of studies in Makel and colleagues’ single-case design replication category.
In our review, we included 46 studies that were excluded by Makel et al. (2016); these studies were not identified in their literature search, and they did not evaluate these studies for inclusion. The most likely explanation for Makel et al.’s failure to identify these studies is variation in how the different databases that were searched index the targeted journals. For example, different databases vary in whether specific journals are indexed as full text or abstract only. Of Makel et al.’s 109 included studies, 88 were not included in our review. Eleven of these were not identified in our literature search; thus, we did not evaluate them for inclusion. Our failure to identify these studies also likely reflects variations in database indexing. In our review process, we did evaluate the remaining 77 articles for inclusion. Of these, we excluded seven studies for being within-manuscript replications and 70 for not meeting our operational definitions outlined in the method section. Most frequently, we excluded studies for using the term replicat* in what we considered a general reference to a replication of previous findings, patterns of response, or research methods instead of an explicit statement that a previously conducted, cited study had been replicated. For example, Rapley, Ridgway, and Beyer (1998, p. 37) indicated in the abstract that their “study replicated the pattern of staff:client concordance and staff overestimation of the independence and autonomy of clients reported by S. Reiter and D. Bendov (1996).” In another example, Mills, Cole, Jenkins, and Dale (1998) indicated “results replicate previous findings” (p. 79); we excluded this study because it was not explicitly stated which prior study was replicated.
For the 24 articles that were included in both reviews, we also examined agreement on coding of type of replication, success of replication, and author overlap. Agreement on type of replication was 58.3%. Of the 10 articles on which agreement was not obtained, our team coded seven as conceptual replications instead of direct replications. Agreement on the success of the replication was 66.7%. Of the eight articles on which agreement was not obtained and Makel et al. deemed a successful replication, our team coded three as unclear, four as mixed, and one as unsuccessful. Author overlap agreement was 95.8%. For the one article on which we disagreed (Lee, Cramond, & Lee, 2004), our team indicated author overlap. The authors of the study reported replicating two previous studies. The first study did not have overlap, as indicated by Makel et al.; the second study did have author overlap, as reflected by our coding.
Discussion
The purpose of our review was to determine the rate at which replication studies were published in special education research journals and examine the prevalence of specific interventions, methodologies, and participants within replication studies. We identified 70 replication articles, resulting in a replication rate of 0.409%. This suggests that special education journals publish a higher percentage of replication studies compared with education journals (0.13%; Makel & Plucker, 2014). As hypothesized, single-case design replication studies accounted for a large portion of included studies (35.7%), one possible reason for the higher replication rate in special education.
We find it troubling that experimental designs (i.e., group and quasi-experiments, single-case designs) accounted for only 45.7% of identified replication studies. Furthermore, we were only able to identify 31 experimental intervention studies and a majority of these (19) focused on nonacademic behavior. These results provide support for Cook’s (2014) suggestion regarding a nonreplication crisis in special education—particularly when it comes to conducting experimental replications to evaluate the efficacy of interventions for individuals with disabilities. In addition, although participants from each of the IDEA disability categories were represented in at least one study, representation of children and adolescents in several disability categories (e.g., emotional disturbance, traumatic brain injury) was limited. Interestingly, individuals with autism spectrum disorder or intellectual disability were represented in 44.2% of replication studies although they represent only 15.7% of students between the ages of 6 to 21 years served under IDEA (U.S. Department of Education, 2013). Overall, the low replication rate raises a concern about the field’s ability to meet current standards for identifying evidence-based practices (Gersten et al., 2005; Horner et al., 2005).
A Different Journey
Our estimated replication rate of 0.409% is relatively close to the replication rate of 0.522% indicated by Makel et al. (2016). However, the low rate of agreement (15.2%) between our two teams on article inclusion is likely of more interest, and perhaps concern. An initial response may be to consider whether one or both research teams missed the mark in duplicating Makel and Plucker’s (2014) procedures for identifying replication studies. However, after a review of both data sets, it is apparent that the low rate of agreement was driven by two primary causes.
First, the research teams used different databases to identify articles that included the term replicat*. In other words, we used different sampling procedures. Our team used PsycInfo, PsycARTICLES, ERIC, and Proquest Central; Makel et al. (2016) used Web of Science. In studying the differences, we learned that the inclusion of full-text journal indices varies across databases (e.g., American Journal on Intellectual and Developmental Disabilities is indexed full text in ProQuest, but not in Web of Science). This means that if a database is used that does not index full text for a specific journal, an article in which the term replicat* appeared within text (but not in the abstract) would not be identified for review. The fact that we identified 2,231 articles which included replicat* in the record compared with Makel and colleague identifying 357 articles supports this explanation. Thus, our initial searches using different databases likely contributed to the discrepancies between our reviews.
Second, our research team further operationalized Makel and Plucker’s (2014) inclusion and exclusion criteria and their coding definitions. As our team worked on establishing reliability across coders, there were many nuances in decision making that could not be determined from the original manuscript. For example, in the original article, Makel and Plucker indicated that their analysis assessed “whether the term [replicat*] was used in the context of a new replication being conducted” (p. 307). Using only this description, we were unsure whether Makel and Plucker would have included many of the studies we reviewed. For example, it was not clear whether a study wherein the authors stated it was a replication of prior research but did not explicitly provide a citation would merit inclusion. Likewise, the inclusion of studies in which findings were replicated—as opposed to methods—or studies replicating assessments or surveys could not be determined from the text.
As our team refined our coding scheme, we chose a conservative approach—requiring authors to explicitly state they were replicating a study (not a procedure, method, or finding) for which a citation was clearly identifiable. This appears to have been a more stringent criterion for inclusion than that applied by Makel et al. (2016) based on the 77 articles included in their current review that we located in our literature search but excluded. We encountered similar challenges in operationally defining the type of replication (e.g., direct, conceptual) and whether the replication was a success or not.
What Can We Learn From Replication?
Some may be disconcerted by the discrepant outcomes across the two reviews. In fact, this was our initial sentiment. We had started our review with the same “roadmap”—the initial Makel and Plucker (2014) manuscript—as the team who conducted the other review. However, Makel et al. (2016) had additional information because two of the authors had conducted the original review. In a sense, they also had a “travel guide” and “travel experience.” Our findings may have been more similar to Makel et al. (2016) had our team initially consulted with their researcher team and used the coding protocols they used in their initial review. However, this would have come at the cost of our independence, and recent opinion regarding replication suggests that replication may be most valuable when conducted independently (Berliner & Glass, 2015). Independence is also often necessary when one considers that the authors of an original work may not always be available or accommodating.
Without the benefit of consultation with the original authors, our team made subtle departures from Makel and Plucker’s (2014) protocol (i.e., databases, operational definitions) that altered our findings. Although scientists may conceptually understand the impact of choices made when conducting a study, the typically limited transparency in reporting the myriad decisions made along the way suggests a lack of appreciation for their potential impact on study outcomes. Consider a recent study (Silberzahn et al., 2015) in which 29 research teams analyzed the same data set to evaluate whether soccer referees are more likely to give players with a dark skin tone a red card compared with players with light skin. Discrepancies in analytic decisions led to substantial variation in outcomes with 69% of teams observing a significant relation between skin tone and referee behavior and 31% observing no relation. The authors noted that the variation was striking given the simple question and the use of the same data set. In our case, we were also surprised by discrepancies in two similar reviews because a body of research literature is stable and well defined. Considering the complexities of conducting intervention work with children and adolescents in school settings, we hypothesize that the increased number of critical components would magnify the challenges of conducting a replication and enhance the impact of subtle departures from original research protocols.
An alternative perspective on discrepant replications comes from Nosek and colleagues (reproducibility project; Open Science Collaboration, 2015). In their independent replications of 100 studies published in three psychology journals, Nosek et al. observed considerable reductions in statistically significant effects (i.e., 35 compared with 97), substantial decreases in effect sizes (i.e., 82.8% of replication studies yielded smaller effects), and a limited number of successful replications (39, based on subjective assessment). Reflecting on these outcomes, Nosek stated, “We think of these findings as two data points, not in terms of true or false” (Carey, 2015, August 28). The authors concluded they had established zero true effects and zero false effects—a reality of doing science. Reflecting further, the authors reminded readers that, “Scientific progress is a cumulative process of uncertainty reduction that can only succeed if science itself remains the greatest skeptic of its explanatory claims” (Open Science Collaboration, 2015, pp. aac4716-4717). Thus, discrepant outcomes should be anticipated. Aside from revealing cases of outright fraud or inexcusably poor quality research, discrepant findings from replications can actually be useful to deepen our understanding of important variables that impact our outcomes. By considering our review and Makel and colleagues’ (2016) review as two data points, we were able to identify the differences in methods (i.e., databases, definitions) that led to our discrepant findings.
Limitations
Several limitations to the present study should be considered. As previously described, our university’s access to databases and JCR reports differed from that of Makel et al. (2016). The use of different databases appears to have been a key factor in many of the discrepant findings between the two studies. We also did not include studies published prior to 1997 or after 2013 due to our JCR access. Had we been able to evaluate the rate of replication for these years, our results may have been different. We highlight, however, that our comparison with Makel et al.’s similar review (2016) was restricted to the same years. Moreover, we excluded studies if the replication was presented in the same manuscript as the original study. Our rationale for this exclusion was that including within-manuscript replications would be biased in favor of single-case design studies. In addition, as noted by Makel and Plucker (2014), within-article replications do not adequately control for experimenter bias, error, or fraud. Had we decided to include within-article replications, our rate of replication would have been higher. Our estimation of replication success is also limited in that we did not review original studies to verify the success of replication studies. It is also worth noting that we may have located additional replication studies had we used a different search procedure or had expanded to additional research journals. For instance, our search identified no replication studies involving math interventions, although these may be published in journals not included in our review. Finally, while searching for the term replicat* is an efficient procedure for identifying studies, studies may be missed if authors do not use a variation of replicat* in their article or if the term is not in a searchable part of the manuscript (e.g., within text for an article only indexed by abstract). As demonstrated by Cook, Collins, Cook, and Cook (2016, In this issue) and Therrien, Mathews, Hirsch, and Solis (2016, In this issue), using alternate methods for identifying replication studies yields a higher estimate of the rate of replication in special education research.
Conclusion
Our findings demonstrate that replication studies are published at a slightly higher rate in the special education research literature (i.e., 0.41%) compared with the general education research literature (i.e., 0.13%; Makel & Plucker, 2014). We also confirmed that several trends identified by Makel and Plucker for the general education literature (i.e., a high proportion of replications with author overlap; a higher proportion of successful replications in studies with author overlap) also exist in the special education literature. However, we believe the limited number of replication studies we identified (n = 70) is problematic for a field that purports to value replication as a tenet of scientific inquiry. Although our search method (i.e., identifying articles in which the term replicat* was present) likely provides an underestimate of the actual rate of replication, our findings highlight a potential deficiency in the use of replication to validate prior claims and an unfortunate disdain for explicitly describing one’s work as a replication study. Special education researchers should reexamine both the use of replication as a tool for inquiry and the explicitness with which aspects of replication are identified in published studies (see Coyne, Cook, & Therrien, 2016).
Beyond this, our exploration of possible reasons for the different outcomes of two reviews seeking to replicate the same previous review highlighted several important lessons. First, we think that academics would benefit from a deeper understanding of how specific electronic databases work. The economics of scholarly publishing mean that universities are being forced to limit subscriptions and, thus, faculty at different universities will likely have access to databases that are not equivalent. Discussions with our information specialist highlighted concerns with the costs associated with providing access to published literature and underscore the potential consequences of not consulting with an information specialist when conducting reviews (Maggio, Tannery, & Kanter, 2011). Although beyond the scope of this article, we encourage readers to learn more about what Larivière, Haustein, and Mongeon (2015) described as “the oligopoly of academic publishing” (p. 1), the impact of corporate control on published research and database accessibility, and arguments for increasing open access to published research (Schmitt, 2014, December 23). Limiting access to published research will inherently hinder scientific progress. Second, replicating a study from a published manuscript poses a substantial challenge. The sheer number of critical decisions made in any study simply cannot be reported within a typical manuscript. Any advances in greater sharing of expanded research protocols through online supplements to print journals or through open access movements will increase our ability to conduct higher quality independent replication studies. In advance of this change, researchers may benefit from consulting with the authors of an original study prior to attempting replication. Third, our review highlights the challenge with locating studies that are replications. Journal editors should consider emphasizing the need for authors to more explicitly identify replication studies within-manuscript abstracts. Coyne et al. (2016) provide additional guidance on conducting and reporting replication studies in special education research.
Our comparison of reviews provides evidence that unanticipated variations in sample, context, or procedures may help explain differences in outcomes (Klein et al., 2014; Open Science Collaboration, 2015). Regardless of how closely researchers attempt to follow a previously published study, context matters (e.g., see Coyne et al., 2013). In educational intervention research, the complexities of all of the interacting variables challenge our abilities to predict when and where an intervention may work. Better understanding of unanticipated variations may hold the key to ensuring that experimental findings generalize beyond the lab into classroom settings (Berliner & Glass, 2015).
Footnotes
Acknowledgements
We appreciate the collegiality of Matthew Makel and his colleagues. This manuscript would not have been possible without their support. We also thank Lee Ann Lannom for technical assistance with library resources.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
