Rigor Mortis: Statistical thoroughness in reporting and the making of truth

Abstract

Should a uniform checklist be adopted for methodological and statistical reporting? The current article discusses this notion, with particular attention to the use of old versus new statistics, and a consideration of the arguments brought up by Von Roten. The article argues that an overly exhaustive checklist that is uniformly applied to all submitted papers may be unsuitable for multidisciplinary work, and would further result in undue clutter and potentially distract reviewers from pertinent considerations in their evaluation of research articles.

Keywords

methodology philosophy of science review statistics

My intention is not to replace one set of general rules by another such set: my intention is, rather, to convince the reader that all methodologies, even the most obvious ones, have their limits.

Paul Feyerabend, Against Method

In the current article, I discuss some issues relating to the relevance of thorough methodological and statistical reporting as the gold standard for assessing the merit of papers. I focus the discussion on Von Roten’s (2015) suggestions for a checklist for reporting, and point out potential short-comings in over-reporting as well as specific problems with some specific reporting requirements.

1. The value of p-values: Seeing evidence in context

As pointed out by Von Roten (2015), the value of p-value can and has frequently been called into question. Although criticized from the outset, the use of p-values as indicators of the truth of a hypotheses has increasingly multiplied over the past five decades, with diatribes against the old paragon of social science objectivity appearing across fields, in medicine (Goodman, 1999), psychology (Shrout, 1997; Winch and Campbell, 1969), sociology (Raftery, 1995), marketing (Sawyer and Peter, 1983), and education (Carver, 1978).

We concur with Goodman (1999) in suggesting that the truth or merit of a particular result should be assessed not in isolation, but in relation to prior knowledge and understanding (although we must not let those squash out all innovation under the illusion that we have some objective and final undergirding for any and all further findings). As he points out, the fetishization of statistics as the only arbiters of truth can be misleading, especially when taken as rote practice and without thorough understanding. It can lead to the pretense that conclusions drawn from results, which exist on an explanatory level different from the cold hard numbers, are objectively merited and are merely a verbal translation of the results. Much like the graphs in our article (Tal and Wansink, 2014), it runs the risk of conferring illusion of truth, an idea we expand on below.

On a very micro-scale, this need for interpreting a piece of evidence within a broader context arises in regards to Von Roten’s (2015) criticism of treating a .07 significance value as “marginal significance.” Although it is true that in isolation one should not make much of such a finding, it should be evaluated in the context of a broader investigation. Here, where similar effects have been replicated numerous times, a strong directional pattern that also obtains a .07 significance level is a meaningful piece of evidence—within the larger context. The strength of the evidence does not rest solely on any such particular finding, but in the convergence of consistent evidence across studies.

In addition, consider that increasing sample size, as well as the use of sufficient covariates, can often reduce noise sufficiently for such marginally significant effects to become significant (although it is true that they would at times prove ephemeral). Use of such statistical tricks in the service of achieving the holy grail of .05 is not unheard of. Would a .05 with multiple covariates, or with a 1000 participants, be more convincing than a .07 with a reasonably sized sample (big enough but not “too big”) and additional noise-reducing tools?

A research climate where research programs spanning multiple studies, coupled with reporting failed as well as successful studies along with a discussion of potential reasons for the failure, would be more educational and instrumental for the “accumulation of knowledge” (a term I hesitate to use) than the current norm of presenting a cleaned out and often illusory picture.

To return to the types of statistics that should be used, the New Statistics merit some recognition. In response to the crisis in belief in the old methods, as well as problems uncovered in current practices (Pashler and Wagenmakers, 2012; Simmons et al., 2011), researchers have recently suggested the use of a range of statistical methods and reporting tools, focusing on effect sizes, confidence intervals, and meta-analyses (Cumming, 2013). These calls have led to changes in the requirements of prominent social science journals (e.g. Eich, 2014), with calls for change in practices resonating up to the highest bastions of Science (pun intended) (Nosek et al., 2015).

Importantly, one should avoid taking the glamor of “The New Statistics” as the end all be all. As much as p-values suffer from limitations, so do other methodologies. We should be careful of falling into the trap of merely substituting one unquestioning orthodoxy for another.

We concur that p-values are indeed fraught with problems, especially when used as the sole arbiter of truth. However, given that substitutes to p-values are still being formulated in the social sciences, and that p-values constitute the old “Gold Standard,” they are often still employed, and their absence in a paper would raise questions with many a reviewer.

As Von Roten (2015) herself acknowledges, authors should communicate in a way that would be comprehensible for their audience. Seeing as p-values are still a common currency, it would seem that throwing those outside the window at this point might be premature. At the least, they should supplement some of the “New Statistics,” while the gold standard for those is being established.

2. It’s raining N (and a plethora of other details)

Considering the thoroughness of reporting, we agree with Von Roten (2015) that the sufficiency of reporting should indeed be supervised by the review team. This in turn implies that it is up to the review team’s judgment to decide whether or not they possess sufficient information to assess claims. If they do not, they may request additional information required to assess claims (although, granted, authors should endeavor to have all necessary information contained in the submitted paper). Exhaustive detail is not necessarily a boon, and decidedly not as a knee-jerk reaction.

There are several reasons why an overabundance of detail may be damaging to a paper: (1) It may make the paper redundantly long and more difficult to read. (2) It may call reviewer’s attention to technicalities at the possible expense of substance. (3) It may dazzle reviewers with an aura of “science well done,” leading to the neglect of potentially pertinent considerations.

To start from the latter point, exhaustive details may, rather than guaranteeing truth, confer an increased “halo” of truth. Much like graphs or numbers can increase credibility in the eyes of the public at large, increased “science” detail can increase credibility in the eyes of an academic audience, which also tends to be composed of human beings who are similarly influenced by “cues to truth.”

In this context, it is worth considering our argument (Tal and Wansink, 2014) concerning the credibility versus the perceived credibility of scientific claims. Importantly, we did not argue that the use of scientific norms or the peer review system confers credibility on scientific claims, but rather that it confers credibility in the public mind. In the current context, I argue that such perceived credibility may be a problem with the reviewing of articles, and that some of the suggestions made by Von Roten (2015) may in fact exacerbate the problem.

3. One size fits all? The issue of suitability

Some of the requirements espoused by Von Roten (2015) appear excessive for at least some investigations, and of questionable applicability for others. This is especially true given the multidisciplinary nature of Public Understanding of Science (PUS), but would apply to many other social science publications, which at times call for a flexibility of method.

The standards adopted by PUS should probably conform to reasonable, concrete requirements that are specific for each particular type of investigation (experimental, survey, qualitative, etc.). We agree that a set of standards, when exercised with judgment (and not blindly) may indeed improve reporting, although my personal experience is that such standards are absent from many journals. Perhaps one reason for such exclusion is that when included, there is a danger for their arbitrary and blind exercise, and for them to become an easy set of criteria around which reviewers focus their acceptance or rejection decisions instead of employing more pertinent considerations.

Given the interdisciplinary nature of works within PUS, care should be taken in the selection of criteria for each type of methodology. Von Roten (2015) claims that her recommendations apply “irrespective of the methodology that produced the data.” However, before that she recognizes that the material covered in the journal is diverse and includes “experiment, survey, content analysis, interview, observation.”

Without going into undue detail, consider the applicability of some of the requirements in the checklist for these methods. To use some obvious examples, employing figures and graphs, or measuring the quality of the model, does not apply for a qualitative investigation. Although the effort to establish an objective and universal checklist of methodological and statistical is laudable, this highlights the potential problem in attempting to establish thorough and uniform standards for a field of study that requires diversity and perhaps innovation in methodology.

The lack of specificity of the list also results in some vagueness in its requirements. The checklist would be of greater use if the specific intent behind some of its requirements would be elucidated. For example, terms such as “quality of the model” are too abstract to be of concrete use in any given setting.

Death by detail: Too much of a good thing?

In addition to its one-size-fits all approach, the checklist suffers from requiring some detail that may be excessive in many particular investigations. Although we lack systematic analysis of this issue (which may in itself be enlightening), we do not recall many psychological articles that do indeed check all the assumptions of any model used. If measures are appropriately designed, some assumptions about the data may be legitimate. Similarly, measures of and the need for measures of the quality of the model would depend on the specific model used.

One potential issue that may arise out of blind enforcement of a checklist is clutter. Exhaustive reporting of all the items mentioned in Von Roten’s (2015) list may well create undue clutter, with considerable redundancy. It is true that in the age of online publication and the endless proliferation of journals, space is no longer a true limitation. For online publication, at least each article can be as endlessly long as authors and reviewers deem fit. However, as Von Roten (2015) herself acknowledges, one should still consider the question of readability. Ultimately, articles are written to be read, and although space may be unlimited, readers’ attention is not. Overly exhaustive reporting within each article runs the risk of losing the forest for the trees, harming the work of the review team. In addition, it is unlikely to be read by most readers upon publication. At most, much of the information can be included in a methodological appendix.

Furthermore, some of the required detail is unfeasible from the outset. Consider for example Von Roten’s (2015) suggested requirement to report all potential uncontrolled confounds in random-assignment studies. In the behavioral sciences, the list of such confounds is veritably and inevitably endless. Consequently, such as endeavor is unrealistic, as well as of dubious usefulness, it is easy enough to report age and gender equivalence, for instance, but consider the many other “standard” variables that may affect behavior (nationality, education, socio-economic status). Even the list of measures that are clearly and distinctly relevant is long. When considering all other variables that might affect behavior—for instance, personality dimensions—the list becomes endless. Bear in mind, as well, that many factors that are not considered to be potential confounds, as well as ones for which credible measures do not yet exist, may also be relevant.

Consequently, what authors include is likely to consist of measures that are fashionable in their time. Reporting a select few metrics in this manner could often merely serve as lip service, another means to confer the appearance of scientific truth rather than any true assurance of the equivalence of randomly assigned groups.

Similarly, it would be impossible to measure and control for all potential covariates in non-random-assignment studies. This should be restricted to the most relevant culprits, with the realization that many other relevant dimensions could be so excluded.

All such heaping of technical details may serve to derail, rather than support, reviewers’ evaluations, blinding them with the appearance of science, and leading them to ignore other substantive criteria for the acceptance or rejection of an article (e.g. its narrative/theoretical merit, the coherence of its argument, and its contribution).

4. So what do we do?

This article does not intend to disparage a thorough reporting of study methodology and results. We applaud the effort to compile a useful checklist for authors and reviewers, and agree with many of the guidelines offered. We agree, for example, that measures should be precisely reported, and that omission of data should be reported along with its reasons (and grudgingly undertaken!) Assuring that designs and results adhere by current standards (while realizing the limitations and changeability of these standards) is a necessary step if we are to rely on published findings.

While many of Von Roten’s (2015) recommendations are sound, treating them as a rigid checklist to be followed may do more harm than good: cluttering papers, making the submission and review process unduly burdensome, serving as a rhetorical tool that undeservedly enhances persuasiveness, and, given limited time and attentional resources, distracting reviewers and readers from other pertinent criteria for evaluation. Rather than aspiring to a conclusive and uniform list, good judgment should be exercised in deciding which items should be reported. Rather than being taken as absolute rules, perhaps the points would be better taken as guidelines to be considered on an individual basis.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Author biography

Aner Tal is a Research Associate at Cornell University and lecturer at Ono Academic College. He obtained his PhD in Consumer Psychology / Behavioral Economics from Duke University, and his Master’s in Philosophy from Tel-Aviv University. His research interests include food behavior and judgment, persuasion, moral and risk intuition, and rationalization.

References

Carver

(1978) The case against statistical significance testing. Harvard Educational Review 48(3): 378–399.

Cumming

(2013) The new statistics why and how. Psychological Science. Epub ahead of print 12 November. DOI: 10.1177/0956797613504966.

Eich

(2014) Business not as usual. Psychological Science 25(1): 3–6.

Goodman

(1999) Toward evidence-based medical statistics. 1: The P value fallacy. Annals of Internal Medicine 130(12): 995–1004.

Nosek

Alter

Banks

Borsboom

Bowman

Breckler

. (2015) Promoting an open research culture. Science 348(6242): 1422–1425.

Pashler

Wagenmakers

(2012) Editors’ introduction to the special section on replicability in psychological science a crisis of confidence? Perspectives on Psychological Science 7(6): 528–530.

Raftery

(1995) Bayesian model selection in social research. Sociological Methodology 25: 111–164.

Sawyer

Peter

(1983) The significance of statistical significance tests in marketing research. Journal of Marketing Research 122–133, Available at: http://www.bauer.uh.edu/jhess/documents/Sawyer%20and%20peter.pdf

Shrout

(1997) Should significance tests be banned? Introduction to a special section exploring the pros and cons. Psychological Science 8(1): 1–2.

10.

Simmons

Nelson

Simonsohn

(2011) False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science. Epub ahead of print 17 October. DOI: 10.1177/0956797611417632.

11.

Tal

Wansink

(2014) Blinded with science: Trivial graphs and formulas increase ad persuasiveness and belief in product efficacy. Public Understanding of Science. Epub ahead of print 15 October. DOI: 10.1177/0963662514549688.

12.

Von Roten

(2015) Statistics in Public Understanding of Science review: How to achieve high statistical standards? Public Understanding of Science. Epub ahead of print 20 July. DOI: 10.1177/0963662515595195.

13.

Winch

Campbell

(1969) Proof? No. evidence? Yes. The significance of tests of significance. The American Sociologist 4(2): 140–143.