Abstract
Current practices in study design and data analysis have led to low reproducibility and replicability of findings in fields such as psychology, medicine, biology, and economics. Because gifted education research relies on the same underlying statistical and sociological paradigms, it is likely that it too suffers from these problems. This article discusses the origin of the poor replicability and introduces a set of open science practices that can increase the rigor and trustworthiness of gifted education’s scientific findings: preregistration, open data and open materials, registered reports, and preprints. Readers are directed to Internet resources for facilitating open science. To model these practices, a pre peer-review preprint of this article is available at https://psyarxiv.com/nhuv3/.
Keywords
The social sciences are experiencing a crisis of confidence (Pashler & Wagenmakers, 2012) affecting the very foundations of scientific research—that is, reproducibility, generalizability, and the accuracy of findings. Though the terms reproducibility and replicability are sometimes used interchangeably, reproducibility strictly refers to the ability to analyze the same data and get the same result, whereas replicability involves getting the same result from a new, identically designed study (Condon, Graham, & Mroczek, 2017). Simmons (2016) wrote, “At some fundamental level a scientist’s #1 job is to differentiate what is true/replicable from what is not” (para. 4). He continued, this “means that replicability is not merely a consideration, but the most important consideration” (para. 5). The ability to be successfully replicated is the most basic requirement that a scientific finding must fulfill to have any value. Neither valid science nor evidence-based policy can be constructed from flukes, coincidences, or one-off findings. Research results must fulfill many requirements to be genuinely useful, but the lowest bar they must clear is that they be observable again. As Karl Popper (1959) wrote, “Non-reproducible single occurrences are of no significance to science” (p. 66).
Emphasis on replication is not new. R. A. Fisher (1935), one of the progenitors of statistics, wrote,
In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure . . . we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (p. 14)
The centrality of replication research dates to the earliest days of statistical inference but has unfortunately been largely abandoned in the social sciences (Makel & Plucker, 2014; Makel, Plucker, & Hegarty, 2012).
Although the degree of replicability of gifted education research is currently unknown, there are good reasons to be concerned. This is because the field relies on the same methods that are responsible for the replication crisis in other social sciences. The origin and causes of the replication crisis are longstanding and complex, but they are important to understand if gifted education practices are to be improved. If best practices are determined by the extent to which research supports their effectiveness, knowing the reproducibility, generalizability, and accuracy of research data are of paramount importance. This is particularly important in relatively small fields such as gifted education, where there is little replication and relatively (compared with many other social science fields) few active researchers.
Problems of Current Practice
We devote the first section to introducing readers to many of the problems inherent in contemporary research practice. The second section introduces some of the principles that we believe will help gifted education researchers produce research with more reproducible, generalizable, and accurate results.
Replications Reveal Inconsistencies
We begin with an example from social psychology, where numerous previously well-established phenomena have not been reproduced in replication studies. For example, in the famous “pen study,” Strack, Martin, and Stepper (1988) asked participants to hold a pen either between their teeth (activating the same muscles used while smiling) or between their lips (activating muscles used while frowning) while rating the humor of a set of The Far Side comics. In the original study, comics were rated significantly funnier (d = 0.39, pone-sided = .03) 1 when participants held the pen between their teeth. This finding was used to support the facial feedback theory of emotion, which posits that facial movements cause rather than respond to emotional states. At the time of this writing (November 2017), the Strack paper had been cited 1,695 times (as counted by Google Scholar) and discussed in numerous introductory psychology textbooks. As far as the field of psychology was concerned, this finding was factual. However, in August 2016 the result of a large-scale, preregistered, multilab collaborative replication of the original study was published in Perspectives on Psychological Science as part of their Registered Replication Report series (Wagenmakers et al., 2016). In this project, 17 independent research teams performed exact replications of Strack et al.’s (1988) original experiment. The project leaders then combined all 17 results via meta-analysis to produce a single aggregated finding. The overall result showed no evidence of a facial feedback effect, with an estimated effect of 0.03 ± 0.13. Bayesian data analysis revealed that all 17 labs produced evidence supporting the null hypothesis of zero effect, with the vast majority providing reasonably strong evidence in favor of the null. 2 This is just one of several recent examples where seemingly established findings have not been reproduced (e.g., Open Science Collaboration, 2015).
Failed replications are a problem extending beyond social psychology and into areas directly relevant to educational psychology and giftedness research. For example, some of the foundational assumptions of growth mind-set theory have failed to replicate (e.g., Bahnik & Vranka, 2017; Li & Bates, 2017). Recent evidence suggests that the phenomenon of stereotype threat may be either nonexistent or at least much weaker than previously thought (Finnigan & Corker, 2016). Other phenomena such as the Pygmalion effect, which appeared large originally, have appeared much smaller on replication (e.g., Jussim & Harber, 2005).
Replication Attempts Have Been Infrequent
In gifted education, Makel and Plucker (2014) found that only 1 in 200 articles were attempted replications of prior research. They found similarly low rates in psychology journals (Makel et al., 2012) and in education journals more generally (Makel & Plucker, 2014). The low rate of attempted replications means false findings in giftedness research, which are an inevitable by-product of any scientific endeavor, are rarely discovered, challenged, or corrected. In most cases, the underlying scientific questions are considered to be settled once initial information is obtained. In one of the few replication studies published in gifted education, Peters and Pereira (2017) analyzed the factor structures of the Gifted Rating Scales, Scales for Identifying Gifted Students, and the HOPE Scale, three of the most commonly used teacher rating scales for gifted behaviors (Callahan, Moon, & Oh, 2013). Their analysis failed to replicate the intended factor structure of each instrument they examined, thus raising questions about the construct validity of the scores produced by these instruments.
There may (or may not) be many other research claims in the field of gifted education that are on less solid footing than many might assume, but there is no way to know because replication studies are exceedingly rare. Researchers in psychology and biomedical fields have responded to such issues by substantially increasing their replication efforts. In this manuscript, we introduce possible paths toward such a response within gifted education, with the goal of increasing the trustworthiness and credibility of gifted education research findings. Doing so would place gifted education at the forefront of education sciences and help ensure that actual evidence underlies the evidence-based practices that the field identifies and promotes to practitioners.
The Origin of Incorrect Research Results
When effects are nonexistent, how is it that researchers find evidence for those effects in the first place? Several mechanisms can contribute to the problem.
Statistical False Positives
All statistical hypothesis tests have the potential for a false-positive (Type-I) error, in which the researcher concludes that there is statistically significant evidence of a nonexistent effect. In fact, that primary virtue of the null hypothesis significance testing paradigm is a Type-I (false-positive) error control. The risk of making such an error is controlled by two factors—the choice of the alpha level (by tradition, set at the .05 level), and, more important, the process used to obtain the statistical evidence (Goodman, 1999; Kruschke, 2013; McBee & Field, 2017; see also Rouder, Morey, Verhagen, Province, & Wagenmakers, 2016). Even in contexts in which a strictly true null hypothesis is inappropriate (e.g., comparison of educational outcomes between states or districts) there can be no such thing as a Type-I error, and statistical estimates can go in the wrong direction (“Type-S” [sign] errors) or can wildly and systematically overestimate the true effect size (“Type-M” [magnitude] errors; Gelman, Hill, & Yajima, 2012; Gelman & Carlin, 2014). Common analytic practices described in the next section dramatically increase the risk of Type-I and Type-S errors, and inflate Type-M errors, all of which decrease the quality and replicability of the literature.
Multiple Comparisons
When researchers test a set of hypotheses, the risk of making at least one false-positive decision increases dramatically with the number of tests. 3 This is the reason why multiple comparisons must be corrected (e.g., by the Bonferroni adjustment). This problem becomes especially severe when many tests are performed but not all are reported. It follows that statistical significance is easily achieved simply by conducting a sufficiently large number of (unadjusted) statistical tests. Given how easy computers have made it to quickly run hundreds if not thousands of statistical tests, this problem is widespread. When a published paper fails to describe the multiple comparisons that led to a specific result, instead presenting in isolation and emphasizing the tests that “achieved significance,” readers are given a highly misleading impression regarding the strength of the evidence supporting the study’s claims.
Sociological Context of Research
The practice of research is embedded in a sociological context that strongly incentivizes particular actions. Deep problems arise when these incentives encourage counterproductive behaviors. For example, most research in education is performed by university faculty and their graduate students who are in the employ of research-intensive universities. These individuals are evaluated annually on their research productivity, as measured by the number of publications produced, the prestige of the journals in which these are published, and the grant funding obtained to support these activities. Individuals must demonstrate strong research productivity to be competitive for future academic positions and additional grant funding. Assistant professors need publications to get tenure and be promoted, and associate professors need publications and often grant funding to be promoted to full professor. Much of the financial model at research-focused universities rests on funds that are generated directly through overhead charges on grant revenue, and this connects research productivity closely to the financial health of the institution. Under this incentive model, the quantity of production is highly prized (Budd, 2017). More is better. Fewer (even if better quality) publications typically are ascribed a lower value in such systems, and lower publication numbers may even be viewed as a sign of inferior or inadequate scholarship (Feist, 1997). For example, Nobel prize–winning physicist Peter Higgs (of Higgs boson fame) argued that he would be unable to land an academic job in today’s market due to an insufficient number publications, as he has published only 10 papers since his groundbreaking work in 1964 (Aitkenhead, 2013; Benderly, 2013).
Academic journals have a limited amount of space in which to put content—their “page budget”—as dictated by printing and binding limitations. 4 This creates strong competition for the available space. Journals are unable to publish all the scientifically rigorous submissions they receive, leading them to select articles based on the novelty of the findings or the compelling nature of the narrative presented in the paper. Papers presenting novel findings or positive claims are viewed as more interesting than replication reports or negative findings. As a result, novel or positive findings dominate the published record.
The relative prestige of different journals is related to their circulation and readership, and this often is measured by the number of times the articles they publish are cited by other scholars in the same field (Hicks, Wouters, Waltman, de Rijcke, & Rafols, 2015). This is measured primarily by the impact factor, as calculated by Clarivate Analytics. Journals are incentivized to publish research that will garner the most citations and thereby increase the journal’s prestige and impact factor. As a result, journals seek to publish work that is exciting and new; that tells a compelling, straightforward story; and above all, that finds statistically significant evidence supporting the article’s central claims.
Despite the central value of null findings in a falsificationist model of science (Dienes, 2016), nonsignificant findings often are viewed as uninteresting and therefore are far less likely to be published. Perhaps most commonly, researchers simply do not write up or submit the results of “failed”/ nonsignificant studies because they anticipate rejection and choose to allocate their efforts in a more productive fashion (Dickersin & Min, 1993; Franco, Malhotra, & Simonovits, 2014). As a result, null results often end up in the researcher’s file drawer (Sterling, 1959) where they are inaccessible to the research community. When studies produce a mixture of positive and negative results, researchers commonly expunge the negative results from the published version of the study to simplify the message or increase its chance of getting published (O’Boyle, Banks, & Gonzalez-Mulé, 2017; Pigott, Williams, & Valentine, 2017). The result is that the published scholarly work on a topic represents a biased sample of the larger universe of research that was actually performed. 5 When null findings are published, these papers typically get cited less often (Duyx, Urlings, Swaen, Bouter, & Zeegers, 2017). A related problem, and one that is particularly salient within gifted education, is the overlapping relationship between advocacy and research: Because of this overlap, researchers may be reluctant to publish findings that show statistical significance but that have implications that do not foster advocacy for children with gifts and talents. This last issue, while clearly relevant, does not have a straightforward solution.
As Schimmack (2012) pointed out, publication bias on its own is sufficient in the long run to destroy the error control properties of p values within the null hypothesis significance testing framework. In other words, when only statistically significant findings are published, the rate of false positives among the published literature is much higher than the nominal 5% that researchers assume to be the case. For example, if nonexistent phenomenon X is studied 20 times, on average one of these 20 tests should yield a false-positive Type-I error. But if this study is the only one published on phenomenon X, the evidence base in the literature will seem to support its existence because the other 19 needed studies have not been recorded in the literature. Fanelli (2010) found that 91% of papers in psychology and psychiatry reported evidence supporting their central hypotheses, the highest of any field he selected. This suggests either that psychologists have such penetrating theoretical insight into human behavior that their hypotheses are almost never wrong or that publication in psychology is strongly biased toward statistically significant findings.
P-Hacking
Strategies for converting nonsignificant findings into statistically significant ones (which then are likely to represent Type-I errors) are collectively known as p-hacking, and these appear to be among the most widespread of the various types of questionable research practices (QRPs; Simmons, Nelson, & Simonsohn, 2011). Although p-hacking has seemingly emerged as the consensus descriptor of these practices, Yarkoni and Westfall’s (2017) label of procedural overfitting is probably more accurately descriptive of what the process actually entails. A common set of such strategies were discussed in Simmons et al.’s (2011) landmark piece, “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” The four strategies included (1) measuring additional outcome variables, (2) adding 10 additional participants per condition and rerunning the hypothesis tests, (3) controlling for a covariate after peeking at the data, and (4) dropping (or including) experimental conditions.
The application of these four strategies in combination resulted in a 60.7% false-positive rate. The researchers then applied these strategies to real data, resulting in statistically significant evidence that participants who were randomly assigned to listen to the Beatles song “When I’m Sixty-Four” actually became 1.4 years younger than those assigned to listen to “Kalimba.” This absurd result illustrates that p-hacking techniques can produce statistical evidence for almost any conclusion a researcher wishes to reach, even without malicious or fraudulent intent. To quote Simmons, Nelson, and Simonsohn’s (2018) retrospective on their false-positive psychology paper, “Everyone knew it [p-hacking] was wrong, but they thought it was wrong the way it’s wrong to jaywalk. We decided to write ‘False-Positive Psychology’ when simulations revealed it was wrong the way it’s wrong to rob a bank” (para. 5).
John, Lowenstein, and Prelec (2012) surveyed a group of 2,000 psychologists about their use of these (and other) QRPs such as rounding p values below the .05 threshold or falsifying data. Responses indicated that many of these practices are in common use. For example, 66.5% of the respondents indicated that they had dropped nonsignificant outcomes from their papers, and 58% reported engaging in “optional stopping” (Lakëns, 2014), by deciding to collect more data after seeing a nonsignificant interim statistical test. Error rates are controlled much more strictly when researchers perform a confirmatory hypothesis test after reaching a preplanned sample size rather than examining dozens or hundreds of exploratory p values calculated at different sample sizes in search of something interesting to report. The true process that generated a finding must be understood to interpret properly the result of a statistical test, yet this information is frequently incomplete or even absent from published accounts.
Hypothesizing After Results Are Known
Another questionable research practice that bears mention is hypothesizing after the results are known (HARKing; Kerr, 1998). HARKing often occurs in response to a “failed” confirmatory study—one in which the primary hypotheses were unsupported by the data. Having obtained an answer that they do not like, researchers search for some alternative research question to which the answer is “yes” (Makel, 2014, p. 4). HARKing becomes especially problematic when researchers take the extra step of rewriting the whole paper, including the literature review and hypotheses, around these revised post hoc hypotheses. In this case, an exploratory study is presented as though it had been confirmatory all along. These two modes of research yield markedly different levels of evidence, and it is misleading for researchers to present the results of exploratory analysis as though it had been confirmatory. Unfortunately, current norms do not prohibit such practices, and current incentives encourage them. Few researchers have the luxury of “wasting” a study by not publishing it when it could be salvaged using techniques that are not prohibited and in some cases are even recommended by journal editors or peer reviewers. It is difficult to insist that researchers behave properly when, as Simmons, Nelson, and Simonsohn (2012) wrote, “. . . there is no shared understanding of what ‘properly’ is” (pp. 4-5).
For example, researchers who failed to observe a hypothesized main effect of curriculum effectiveness in a Javits project might then examine unplanned post hoc subgroup analyses. If an overall effect wasn’t found, perhaps a treatment effect will be observed in males only or in African American students, or in Title I schools but not others. Given the reality of confirmation bias and the ease with which the human mind can concoct ex post facto explanations for almost anything, it is quite easy to produce a plausible theoretical explanation for any post hoc finding. Based on our combined experience reviewing journal and conference submissions in the field of gifted education, we believe that this is one of the most common QRPs in the field, so much so that people do not even think of it as a questionable practice. Perhaps a gifted curriculum showed no main effect on student learning, but when authors conducted several rounds of post hoc subgroup analysis, they found a statistically significant effect for low-income, African American students. The curriculum is then presented as if it is a research-supported intervention for low-income African Americans when, in reality, this finding is likely to be completely spurious and unlikely to replicate.
Many researchers believe that HARKing (and other QRPs, e.g., optional stopping) are fully responsible for the findings reported in Bem’s (2011) infamous precognition paper. This paper was published in the Journal of Personality and Social Psychology, arguably the flagship social psychology journal. The nine studies it described seemed to show that participants’ past behaviors were being altered by future events. For example, in Bem’s Experiment 1, participants predicted with greater-than-chance accuracy (p = .01) on which side of the screen an erotic image—but not any of the other four types of images that were examined—would appear. It strains credulity past the breaking point to believe that a specific hypothesis predicting precognition effects for erotic stimuli but no other type existed in advance of seeing the data, and the pattern of sample sizes across Bem’s nine experiments is suggestive of optional stopping (Yarkoni, 2011). Indeed, the publication of this paper is now considered to be one of the watershed events in psychology’s crisis of confidence, as it provided a clear example of the fallibility of current statistical and research practices (Gelman, 2016). If status quo methods could produce such implausible findings, then it stands to reason that they could have produced erroneous findings many times before and since—and in ways that were insidiously undetectable to peer reviewers or journal editors, the quality control agents of academe.
What About Meta-Analysis?
Meta-analysis, although a useful research tool (e.g., Pigott & Moon, 2016), is not a viable alternative to replication or an effective way to correct a flawed scientific record. The quality of a meta-analytic study is limited by the quality of the individual studies on which it is based. Put bluntly, “garbage in, garbage out” is the rule. Moreover, given the file drawer problem and other forms of publication bias, meta-analytic findings of published results may lead to overconfidence in a phenomenon by making overestimating effect sizes due to aggregated over findings affected by Type-M (magnitude) errors (e.g., Gelman & Carlin, 2014; Simonsohn, 2017). Meta-analysis can be a useful technique, but it is one that is best applied to preprints or registered reports (see below) in which publication decisions are decoupled from the study’s findings.
Other Cautionary Examples
Although examples from outside gifted education may seem remote to some readers, we believe that they offer instructive cautions that apply to research on giftedness. Below, we present some examples that have been widely publicized in psychology research but perhaps are less familiar to scholars or practitioners working in education.
Given reporting standards that typically do not mandate that authors describe the true process that led to their claims, perverse incentives operating at the level of individual researchers and academic journals, widespread adoption of p-hacking strategies, and lack of a robust culture of replication (meaning that false-positive claims are never detected), one might predict that the overall truth value of published claims in the social sciences and allied fields is very low. Indeed, that is precisely what has been observed.
The Reproducibility Project: Psychology was a collaborative attempt by 270 individual researchers to replicate 100 studies published in a variety of subdisciplines of psychology. Ninety-seven percent of the original studies reported statistically significant effects, with a mean effect size of .403. On replication, only 36% reported statistically significant results, with a mean effect size of .197. The authors deemed that 39% of the original effects were successfully replicated (Open Science Collaboration, 2015). Five of the six large-scale Registered Replication Reports published in Perspectives failed to obtain any evidence in favor of the original claim.
In the field of cancer biology, Begley and Ellis (2012) attempted to replicate a set of 59 “landmark” studies in the field. These papers had been published in journals with impact factors of at least five (indicating 2-year citation rates far larger than any education journal) and had received, on average, several hundred citations each. Only 11% of these studies were replicated. Neither the number of citations nor the impact factor of the publishing journal had any relationship with the likelihood that a study’s results would be supported in replication.
As the preceding examples confirm, QRPs are common across many fields of study, and there is no reason to suspect that giftedness research has been insulated from such practices. In the absence of strong norms about the boundaries of acceptable practices, grey areas, fraud, HARKing, and other QRPs will continue to flourish, thereby preventing the furtherance of science.
Open Science
In the previous section, we described several problems plaguing the current social science literature and its research practices. In response, a growing movement in the social sciences revolves around a range of practices generally referred to as “open science.” Open science practices not only help improve the quality of the research produced; they also benefit the individual researchers who use these techniques (e.g., Markowetz, 2015; McKiernan et al., 2016; Wagenmakers & Dutilh, 2016). Though listing all the benefits of open science practices is beyond the scope of the current manuscript, we refer the reader to Makel and Plucker (2017) for a more complete review. Table 1 summarizes some open science practices and the problems they help address and recommends some resources to assist in their implementation. Below, we describe in greater detail four specific open science practices (preregistration, open data, open materials, and preprints), explain what each contributes to the research endeavor, consider potential challenges, and suggest how individual researchers can begin incorporating these practices.
Resources for Implementing Open Science Practices.
We are not making a blanket statement that all giftedness research should use all these practices in every study. Rather, we recommend that giftedness researchers begin using these practices, whenever possible, to help improve the quality of their work and the associated value it provides to practitioners and policymakers. The choice of which open science practices are implemented (and even whether they are used at all) should be determined by the individual researcher and based on the specifics of the particular project.
Preregistration
Preregistration requires researchers to articulate the confirmatory and exploratory components of a study prior to data collection. For confirmatory aims, the hypotheses, design, and analysis plan must be articulated. Several websites, such as the Open Science Framework (OSF: https://osf.io) and aspredicted.org provide free and easy ways for researchers to preregister their predictions and analysis plans (and the OSF site includes a step by step how-to example: https://osf.io/sgrk6/). Some authors express concern that preregistration creates additional work, but we feel it simply shifts work from the writing phase to the planning phase. Researchers have to write these sections at some point anyway (e.g., to receive institutional review board [IRB] approval), so preregistration merely shifts the time frame in which this writing is done. Preregistration provides readers and reviewers with confidence that exploratory and confirmatory analyses are appropriately identified and that HARKing, p-hacking, or other QRPs were not the cause of a finding. This increases the persuasive value of the research (McBee & Field, 2017). Crucially, these plans can be kept completely private until after the research has been completed so that others cannot see what the researchers have preregistered. This privacy prevents anyone from scooping or even plagiarizing others’ good ideas and also prevents any kind of reviewer bias to come into play as a result of any unmasking that might occur prior to publication. Thus, preregistration offers a very straightforward way to increase the validity and overall trustworthiness of research.
Misconceptions About Preregistration
There are some widely stated concerns about preregistration that can be misleading. First, preregistration does not hinder exploratory research or frown on its use. Rather, preregistration helps differentiate what is exploratory from what is confirmatory. Acknowledging that these two types of research are different is not a burden, nor does it mean that one is “better” than the other; rather, such acknowledgement is a necessary part of the scientific endeavor. Moreover, after preregistering hypotheses and analysis plans, researchers still can add subsequent analyses that were developed after data collection—these are then designated as exploratory. Preregistration simply helps make it clear, within the final description of the work, which part is which. Through preregistration, any reader easily can verify which results were confirmatory and which were exploratory.
Benefits of Preregistration
Preregistration is simple to complete, free, and increases the credibility of the work by reassuring readers that QRPs, HARKing, or selection on significance were not the cause of a finding. It also helps prevent researchers from forgetting their original hypothesis and protects against hindsight bias. The only “downside” is that preregistration prevents any p-hacking to produce the statistical conclusion(s) that the researcher may have hoped to reach. Preregistration increases the credibility and rigor of a manuscript, as well as of the larger body of knowledge to which it contributes. Of course, if what is preregistered is vague or incomplete, it will provide little support. In their checklist for avoiding p-hacking, Wicherts et al. (2016) noted that preregistration provides strength only to the extent that specificity, precision, and exhaustiveness are included. Again, if researchers have no a priori hypothesis in mind, then an exploratory study is perfectly reasonable. But preregistration helps support and enforce the distinction between confirmatory and exploratory work.
In addition to strengthening the quality of the research literature, preregistration also benefits researchers and the research process. Wagenmakers and Dutilh (2016) recently outlined “seven selfish reasons for preregistration.” These included giving clear credit for actual predictions (and providing evidence that they aren’t HARKed), adding protection against postpublication accusations, and helping avoid the de facto requirement that results be statistically significant to be published. Although most published papers in the social sciences present statistically significant findings (Fanelli, 2010, 2012), the fundamental goal of scientific research is not statistical significance or even novelty; it’s the pursuit of ever-greater approximations of underlying reality. Removing the burden of requiring statistical significance from published research will help remove counterproductive incentives for individual researchers to engage in QRPs by rewarding rigorous methodology rather than sensational findings. Moreover, given that research papers can be seen as persuasive essays whose goal is to convince readers to update their understanding of the world (McBee & Field, 2017), it is in the author’s direct personal interest to take actions that increase the credibility. Credibility is a prerequisite for impact, and we believe that preregistration is the single most effective way by which authors can increase the credibility of their claims. The degree to which this increase in credibility could aid in gifted education policy advocacy cannot be overstated.
Preregistration also provides direct benefits to education research at large. Peer review is fallible; it misses mistakes at times, and it can be manipulated even in the absence of any ill intent. Preregistration decreases the degree of blind trust that readers must grant to authors by providing confirmation that a rigorous process was actually followed. For example, when clinical trials started requiring preregistration of hypotheses, statistically significant findings dropped from 57% to 8% of trials (Kaplan & Irvin, 2015). Some researchers may find this disturbing because it means that they can publish fewer “statistically significant” studies, but the larger positive goal is that published studies will be more likely to offer a true representation of the phenomenon under investigation.
Perceived Challenges to Adopting Preregistration in Gifted Education
Many giftedness studies use preexisting, administrative, or archival data that exist prior to the project’s inception. This can complicate the preregistration process. With national data sets, researchers may (or may not) already be familiar with findings, descriptive statistics, or other information derived from the data (e.g., perhaps because these have been published in other studies that have drawn on the same data sources). When working with preexisting district-level data, researchers may have only limited-term access due to data sharing agreements, or may be unable to see details of the data prior to lengthy approval procedures, hindering the ability to make sufficiently specific a priori predictions.
Although researchers would be unable to produce a truly a priori preregistration document under such circumstances, a more limited (though still valuable) form of preregistration can be used. Authors still can and should articulate their research hypotheses, the statistical analysis plan, and housekeeping details such as how missing data will be handled or the weighting strategy, in advance of analyzing the data or even in advance of taking possession of it or gaining any detailed knowledge about it. It is important to understand that preregistration is not a fraud-prevention mechanism but rather a hedge against confirmation bias and post hoc reasoning. The purpose of preregistration under these conditions is simply to enable the distinction, to the extent possible, between exploratory and confirmatory research.
The question of how the preregistration model can be applied to research beyond the original narrowly controlled examples (e.g., lab experiments) is a topic under active discussion by the open science community at the time of this writing. A statement of principles and a proposed preregistration template articulated at the 2017 meeting of the Society for the Improvement of Psychological Science is available at https://osf.io/cgw86/. An example of a preregistration document for a study using archival data (though not in the social sciences) can be found at goo.gl/PTKM5q.
Open Data and Open Materials
Sharing project data and study materials allows readers to verify the correctness of the presented statistical analysis, to test for the robustness of results against alternative model specifications and facilitates exact replication studies. Study data could consist of de-identified data sets. Study materials include items such as analysis code, project memos, codebooks, questionnaires or instruments, stimulus materials, lab notebooks, data collection schedules, grant proposals, IRB protocols, videos or photographs of the research setting, researcher scripts, or any other archival record generated during the study.
Benefits of Open Data and Open Materials
Openness and transparency are fundamental scientific values. The Royal Society, established in 1660, is perhaps the world’s oldest scientific society. Its motto of Nullius In Verba translates roughly as “Take no one’s word.” Science should not rely on trust, but rather, it should facilitate the open verification of findings and claims. Verification requires that individuals be able to access the materials needed to check every link of the inferential chain for errors or fraud and also that the procedure for reaching a set of conclusions is fully documented. Ideally, any interested party should be able to run the researcher’s analysis code on their publicly posted data and reproduce every value, statistical test, table, and figure in a manuscript. As Lakëns (2017) wrote, “When you want to evaluate scientific claims, you need access to the raw data, the code, and the materials” (para. 2). Readers are simply unable to determine the validity of scientific claims without this information. Moreover, many federal agencies and foundations (e.g., NSF, NIH, Wellcome Trust) have begun requiring that data for the studies they fund be open, thus supporting a transition to open data practices. Sharing of data and materials can be accomplished either using an institutional site or via a third-party site (e.g., https://osf.io/).
Open data and open materials have numerous benefits to individual researchers as well as to the larger body of science (Lindsay, 2017). For individual researchers, these repositories can serve as an archive and backup of all data and materials, providing a record that is not tied to a specific computer or employer’s infrastructure, thus minimizing or even preventing the issue of “lost” files or outdated software formats. Open data and open materials also allow other researchers to use previously collected data sets and materials to address their own future research questions (including for research syntheses and meta-analysis), thus putting the data to additional use without putting an extra burden on researchers to respond to requests or find files they may not have used for years. Moreover, rather than such materials only being available on request, making them open implies that they are available automatically; no request must be made and no bureaucracy or fees limit others’ access to the materials. Such sharing becomes a default approach that benefits everyone, both by saving future researchers’ time and by saving the materials themselves from obstacles such as obsolete file formats or even the retirement or death of members of the original research team.
Collecting data and creating materials takes time and resources, so their creators should be credited for this work. For example, although reusing someone else’s analysis code can be a huge time saver, the person who created that code also deserves recognition. Citing prior publications is the most commonly used means of giving credit in current practice. But sharing materials actually increases the number of ways in which creators can be given credit. For example, the OSF site (https://osf.io) can generate digital object identifier (DOI) numbers for code, data sets, or other materials, making their citation easy. This way a researcher is not expected to give up her work without receiving credit for that work. Open materials are helpful resources, but these should not be assumed to be without faults. Mistakes happen that do not always get caught by individual authors or even in the peer review process. Thus, all researchers should thoroughly evaluate all materials, regardless of whether they have been used previously.
Beyond the OSF, there are numerous other data and materials sharing resources whose only cost to users is the time spent in learning how to use and integrate them into their research practices. For example, at a basic level, shared online accounts such as provided by Google Drive and Google Documents can be used to share materials among coauthors and even to make these public. Figshare allows sharing data, figures, and even entire manuscripts. Another similar option, Github, facilitates version control as well as open source and collaborative development of code that can be shared from the onset or on completion. Journal publishers also increasingly offer the opportunity to include ancillary online materials to supplement published studies, ensuring that these materials remain connected to the publication with which they are associated.
Perceived Challenges to Adopting Open Data and Open Materials in Gifted Education
Not every manuscript will be able to share all its data or materials. For example, if the manuscript evaluates a copyrighted curriculum for a gifted program, the authors cannot share the curriculum. However, if the curriculum has a copyright, it is an established and known product. Citing the curriculum so that an interested reader could pursue purchasing the right to use it shows that the authors are following the spirit of open materials, to the extent they are able to in this situation. Some openness is better than none.
Similarly, not all data can be shared openly. For example, a recent longitudinal study of the educational, occupational, and creative accomplishments of talent search participants (Makel, Kell, Lubinski, Putallaz, & Benbow, 2016) could not share data because reporting the specific occupation of an individual could in many cases reveal the person’s identity. Similarly, reporting an individual’s education history, (i.e., where they went to college, where they went to graduate school, where they are employed) also has the potential to reveal participant identity. In educational contexts, researchers using restricted administrative data (e.g., from a national data set or school district) may not have permission to share these data. But again, some openness is better than none.
Posting study data sets to public repositories can only be done with IRB permission and participant consent. In education research in which participants are often minors, raw data sharing may be impossible or simply unwise, depending on the circumstances. Protecting participant anonymity and privacy is the overriding concern, and the scientific virtue of open data must always take second priority. However, there are alternatives to full data sharing that are still beneficial. It is well-known that many statistical analyses, such as regression, factor analysis, and structural equation models, can be completely reproduced from sufficient statistics of the data—typically, the variance-covariance matrix, means of each variable, and the sample size. Publishing sufficiently detailed summary statistics provides many of the benefits of full open data, for instance, allowing readers to reproduce the full analysis while avoiding many of the challenges of protecting privacy and working within the confines of IRB requirements. We advocate for sharing as much of the data and as many study materials as possible. Perfect should not be the enemy of the good.
Masking the identity of the researchers may also be helpful, particularly when posting data and materials prior to peer review. Such blinding can serve as an important hedge against many types of bias. Avoiding such bias is critically important, especially for researchers from traditionally underrepresented groups, smaller universities, those who have not achieved high status within the field, or those whose work challenges the findings of others who have achieved high status. However, open data can be shared anonymously for peer review using the OSF and other online resources. Then, once the study is accepted for publication, the identity of the research team can be revealed.
Statistical analysis code is one of the most valuable (yet least encumbered) materials that can and should be shared. The method section of a manuscript almost never describes an analysis with sufficient detail so that all ambiguities are resolved. Researchers can provide much more clarity regarding the pathway from raw data to statistical conclusions by sharing their analysis code. When analysis code is published along with either raw data or sufficient summary statistics, interested readers can download these materials and verify that the analysis is reproducible. They can also experiment with different model specifications or analytic approaches, allowing the robustness of the findings to be explored. The use of point-and-click statistical software such as SPSS does not preclude code sharing, as researchers can elect to save and share the underlying code produced by the graphical interface.
Preprints
Preprints are published drafts of research manuscripts that are posted online, either prior to having gone through the traditional journal review process or after they have and are in press. In some fields, such as physics, there is a long tradition of posting preprints through the website https://arXiv.org as soon as manuscripts are complete and before going through the traditional journal peer-review process. The arXiv site started hosting preprints in 1991 and now hosts more than 1.2 million of these documents. More recently, several other fields have created similar sites, such as PsyArXiv.org (psychology), SocArXiv.org (sociology), and, more generally, https://osf.io/preprints, which is an open preprint repository that allows searching of preprints across domains. Preprints posted to these services are indexed by Google Scholar and other scholarly search tools, enabling interested readers to discover and access these works. Although we believe that this level of openness and access is a net positive, readers must bear in mind that preprints have not necessarily been peer reviewed and therefore are perhaps more likely than published articles to contain major flaws. Caveat emptor remains the order of the day, regardless of whether an article has been peer reviewed.
Benefits of Preprints
The current academic publication system juxtaposes the steps of research evaluation and research publication, whereas separating these steps via preprints can help accomplish several relevant goals. First, preprints help reduce the file drawer effect (Rosenthal, 1979), as well as other malign incentives that may encourage QRPs, by allowing authors to share all results regardless of their statistical significance, political expediency, or novelty. Second, preprints can (optionally) accelerate the dissemination of findings by moving dissemination ahead of external evaluation, depending on when in the process authors decide to post them. By allowing the authors (instead of others) to determine when their work is ready to be shared, researchers are given greater autonomy over both their content (in terms of what and how this content is presented) and the timing of its publication and dissemination. Preprints posted prior to publication remove the delay between study completion and publication.
Third, preprints facilitate feedback from a wider audience than the two to four peer reviewers who typically evaluate journal submissions. More eyes on a piece is generally desirable, as reviewers can sometimes fail to notice important flaws. This feedback can be incorporated into revisions of the manuscript. Fourth, preprints allow authors to circumvent the paywall access to the work that publishing companies usually impose. Although paywalls can lead to astounding profits for publishing companies (Buranyi, 2017), they are harmful to the public interest and serve to limit the potential audience that research can reach. After publication in an academic journal, the title page of a preprint can direct readers to the version of record while maintaining a means of access for interested readers who lack journal subscriptions. Finally, preprints help simplify the roles of journal reviewer and editors by reducing or removing their gatekeeping role, instead focusing efforts on the goal of evaluation of the research (Nosek & Bar-Anan, 2012).
Publishing companies vary regarding their preprint policies. Most publishers allow authors to post the final, nontypeset version of a journal article to their personal website or a university website. However, many publishers allow only pre–peer review (before submission), nontypeset versions of a manuscript to be posted to indexed repositories such as PsyArXiV. Consult the publishing company’s website for the journal at which you intend to submit in order to definitively determine your rights regarding preprint dissemination. A link to the SAGE guidelines (publisher of this journal) can be found at goo.gl/pT1SHD
Perceived Challenges to Adopting Preprints in Gifted Education
One potential consequence of preprints is that there may be confusion over the version of record if the manuscript is available in several different forms. For example, if the preprint is the originally submitted manuscript but the published article was revised, there may be important differences between the two. The existence of multiple versions can split citations across the different versions, thus pointing future readers to different texts and potentially harming the research team’s h or i10 indices of scholarly impact. Moreover, readers of one version may not be aware that a different version exists, much less that it is available and is being read.
One action authors can take to minimize these concerns is to cite and link to the version of record (probably the published journal article version) in their preprint (no matter which version of the manuscript it is). This way, any reader (particularly those looking for meta-science purposes such as meta-analysis) will be aware of the alternative version. Ideally, authors would also be able to direct readers of the published article to the preprint as well. This way all readers would be aware of all versions and would be able to find and track citations to all versions of the manuscript. The DOI number is particularly important for identification of the version of record.
Similar to conference presentations, preprints posted prior to peer review have the potential to reveal the identity of the research team, depending on when in the publication process they are shared. Unlike conference presentations, preprints can be shared anonymously, though it would be difficult to disseminate or promote them to the broader research community without jeopardizing anonymity. A recent study of anonymity in peer review found that famous authors and those from high-ranked institutions were preferentially selected when researcher identities were revealed (Tomkins, Zhang, & Heavlin, 2017). Thus, early-career researchers or those from traditionally underrepresented racial, socioeconomic, gender, or sexual identity groups may experience negative bias from a lack of anonymity. We expect that the field of gifted education will develop norms regarding when in the process preprints should be posted online, as well as how to balance the positive benefits of preprints (e.g., immediacy, feedback, and accessibility) against the drawbacks of violated anonymity and the sharing of non–peer-reviewed and potentially low-quality work with the public. On balance, we believe that the benefits of preprints outweigh the negatives, particularly when they are shared after the paper is accepted for publication in a peer-reviewed journal.
Actions Journals Can Take
It is our belief that the largest hurdle to implementation of the preregistration, open data/materials, and preprints is the perceptions of individual researchers. Authors and journals alike can implement the steps described above without any external systemic change, but there also are additional steps journals can take to reward and facilitate these actions. Below, we describe these steps and offer suggestions for their implementation.
Open Science Badges
Badges are small icons that appear on the title page of published articles. These communicate to readers that the article in question has been produced in accordance with a particular open science practice. Badges offer a near-zero cost mechanism (for both the researcher and the journal) to recognize and reward open science practices. Put simply, badges are a form of recognition for research that has met standards for open science, and there is empirical research supporting that they serve to increase the use of open science practices and to increase the availability and accuracy of data (Kidwell et al., 2016). The most common badges (see Figure 1) relate to open data, open methods, and preregistration.

Badges acknowledging open science practices.
When an author first submits an article for publication, or when he or she submits a final version (after acceptance), he or she can indicate which open science practices have been used in the study. For each badge, the author answers a series of questions. The accuracy of these responses is verified by the journal staff prior to badge bestowal.
For the open data and open materials badges, these questions can be as simple as indicating the permanent online link where researchers can find the original data, on which the analysis were based, for replication purposes. A permanent link is also provided so other researchers can access the study’s methods in sufficient detail to conduct a true replication. This might include software code or simply a more detailed version of a data analysis section. As noted earlier, several sites already exist that can accomplish this at no charge.
For the preregistration badge, the author provides evidence of the preregistration time stamp along with assurances that the final analyses match those presented in the preregistration. The author’s answers and assurances are then reviewed by the journal, and, if they are satisfactory, the article is published with the relevant badges pictured on its first page.
Badges indicate that the arguments being made in the paper are less reliant on trust on the part of the reader. Instead, key aspects of the evidence supporting the paper’s arguments are independently verifiable. For this reason, badges signal a higher quality and trustworthiness. We believe that it is likely that badged articles will be cited more frequently as well. At the very least, the findings presented in badged articles are more likely to be trustworthy, and this will increase their influence on policy and ability to persuade skeptics (McBee & Field, 2017).
Registered Reports
An additional option that journals can take is to allow for the submission of registered reports for consideration rather than only accepting completed manuscripts. Registered reports involve the author submitting the manuscript’s introduction, literature review, hypotheses, and proposed methods for peer review prior to any data collection or analysis. This initial stage of a registered report is quite similar to a grant application or a dissertation proposal. The journal reviewers then assess the quality of the study design and the information it will provide, without any consideration of the desirableness of the findings. Instead, it should be reviewed and accepted or rejected based on its theoretical premise, hypotheses, and methodological rigor. Thus, the publication decision is decoupled from the findings, removing all incentives for researchers to engage in QRPs.
If successful in its proposal, the project receives an in-principle acceptance for publication. Another round of review occurs after data are analyzed, but this is more perfunctory than the traditional review process. It is closer to a check of whether the author(s) conducted the study as proposed and produced a manuscript suitable for publication. If the article is poorly written (which likely would have been noticed at initial submission) or makes inappropriate logical leaps in its implications section, the reviewers and editors are empowered to require changes prior to publication. Similarly, unforeseen difficulties in data collection could complicate final acceptance and, like all manuscripts, would have to be dealt with on a case-by-case basis. For example, if only a third of the planned number of participants were included, this could be reasonable grounds for not publishing the final manuscript. For more detail on registered reports, see Chambers, Feredoes, Muthukumaraswamy, and Etchells (2014); Nosek and Lakëns (2014); or https://cos.io/rr/#RR.
In addition to improving the quality and trustworthiness of the scientific literature, we believe that embracing the registered reports format will confer benefits on researchers as well. In the current peer-review process, which evaluates completed projects, it is often the case that reviewers identify major shortcomings in a study’s design or instrumentation that allow for alternative explanations of the findings. By this point it is too late to alter the flawed design because the data already have been collected. In some cases, this leads to the study being rejected for publication altogether, in which case the authors have wasted considerable time, resources, and effort. Research subjects also may have been exposed to potentially harmful (or at least nonbeneficial) interventions for nothing. In other cases, authors, having been rejected by a “flagship” journal, will resubmit the paper to a different journal that is perceived to be less selective. The paper is once more sent out for peer review, encumbering another panel of reviewers and their expertise, thus creating more work for all involved, including reviewers and editors. If the paper is eventually deemed to be publishable, it is certain that the authors will need to add extensive discussion of the study’s flaws to the “Limitations” section of the manuscript. The paper ends up contributing less value to the literature than the authors had hoped because, in the final analysis, the evidence for the study’s central claims is weak.
By addressing these potential issues up front, we believe that the registered reports format offers a more humane process for researchers and reviewers. Peer review of the study’s methodology occurs at a point in the process when changes can actually be made, not after it is too late. The specific editors and reviewers working on a registered report may shoulder a larger burden of the work, but this work is concentrated only on those individuals—not the other sets of potential reviewers who will be encumbered as the article is “shopped around” to different journals.
The adoption of registered reports will help avoid putting reviewers and editors in the difficult position of telling their colleagues that a study is doomed and that effort was wasted. Through registered reports, reviewers also avoid being placed in the unenviable position of second-guessing a study’s purported claims (e.g., “there’s no way that effect size can be so large!”) due to unprovable suspicions about the use of QRPs. At the time of this writing (November, 2017), 80 journals (for an updated list, see https://cos.io/rr/#journals—Gifted Child Quarterly was the 71st!) are now accepting registered report submissions, and the list is growing on a weekly basis. We hope that all the gifted education journals will soon appear on this list.
A Call to Action
In our ideal world, all researchers would use open science practices as much as possible and journals would incentivize open science practices, while leaving room for reasonable exceptions. That said, we know some of these changes would greatly challenge the existing incentive system and, as such, will be controversial. Here are several simple steps that journals can take to move toward more open practices:
Move immediately to implement the badge system. This would involve minimal work for a journal, its reviewers, or its editors. Journals could add check boxes to their submission system through which authors could indicate if they have preregistered their study, if they are making all data available via a permanent link, or if they provide detailed methods (with sufficient detail to allow for replication) via a permanent link. Authors not wishing to commit to these practices could still have their manuscripts submitted and reviewed in the usual manner, but those who wished to move toward open science practices would be able to have the relevant badges printed on their articles. This would provide readers with a greater degree of confidence in the results of the badged papers as well as inform them if data and methods are immediately available for replication or for other research purposes.
Encourage reviewers to be on the lookout for open science practices when reading manuscript submissions. Many reviewers already look to see if study methods are presented in sufficient detail to allow for replication. Another step a journal could take would be to encourage its reviewers to look for statements about preregistration, more detail on the specific hypotheses being tested, and whether the data will be made publicly available. The journal Psychological Science recently added a requirement for all manuscripts reporting new empirical data to include an open practices statement (https://www.psychologicalscience.org/publications/psychological_science/ps-submissions#OPS). This statement does not mandate open practices; it merely mandates the explicit statement of whether the work was preregistered and whether data/materials are available. No explanation is required, and the answers are said to have no bearing on the peer-review process.
Accept and encourage replication studies and publication of null findings. Editors should make clear that they support and welcome replication research submissions for publication. Furthermore, editors should be on the lookout for reviewers who are hostile toward a submission simply because it found a nonsignificant (yet theoretically justified) effect or because it was a replication attempt of a prior study. As Makel and Plucker (2014) argued, facts are more important than novelty.
Allow authors to submit research for review before the results are known via the registered reports format. Very little of the review process would need to change to do this. Reviewers would not be influenced by whether or not the findings were supportive of a particular agenda, or by whether they were statistically significant. Moreover, authors would get feedback on their method and analysis plans at a time when they can ethically alter these plans and avoid mistakes that they may have overlooked.
Before moving on from this topic, we want to again point out how inexpensive and effortless it would be to implement most of these open sciences practices. Committing to a research hypothesis before conducting any data analysis does not require extra time on the part of reviewers or editors beyond simply accepting a preregistration certificate. Checking to make sure data sets have indeed been uploaded to a permanent repository would similarly take only a few moments. We believe that the benefits to the field enormously outweigh the costs.
Conclusion
A field that seeks to be at the cutting edge of education can also reap great benefits from being at the cutting edge of research practices. This may be doubly true for fields such as gifted education, which often struggle to garner public support (Plucker, Makel, Matthews, Peters, & Rambo-Hernandez, 2017). Not every study needs to implement every open science practice. However, every additional aspect of methodological rigor that can be added to research armors the subsequent results from potential detractors. Open science practices such as preregistration, open data, open materials, and preprints can help improve the rigor of research, and increase access to important materials, which can help disseminate quality work.
No research study is perfect, but that does not mean that methodological rigor cannot improve the value individual studies provide. None of the practices discussed here are a panacea. Nor do they avoid all problems that can arise in the research process. However, they each help reduce or remove common ailments that weaken research results. Such efforts will increase the credibility and larger utility of the field’s work (Munafò et al., 2017).
Footnotes
Authors’ Note
Coeditor Michael S. Matthews is an author on this study, which initially was submitted on April 29, 2017, prior to Dr. Matthews’ editorial involvement with the journal. The manuscript was accepted for publication under the editorship of D. Betsy McCoach and Del Siegle, who were also responsible for all phases of the review and revision process. The article manuscript reflects the views of its authors and does not at this time describe official editorial policy for Gifted Child Quarterly.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
