Abstract
Should mandatory theorizing be a condition for researchers to publish articles? Should more synthesis be required, perhaps with the aid of meta-analysis? This short comment focuses on both questions that arise from Phaf’s article, “Publish less, read more” (2020).
Hans Phaf (2020) published a praiseworthy article that explains some of what is detrimental to experimental psychology and that suggests solutions. For full disclosure, I was a reviewer on an earlier version and believe that the author fairly addressed the criticisms contained in my original review. In addition, I agree with the editor’s decision in favor of publication, though with slight reservations. Although a minor goal is to bring out the reservations, the larger goal is to use Phaf’s commendable piece as a vehicle for saying things that need to be said.
Phaf (2020) would like to see mandatory theorizing, and for experimental psychologists to test competing hypotheses in the tradition of Platt (1964). Rather than testing the substantive hypothesis against a generic null hypothesis, as is typical, Phaf recommends testing competing substantive hypotheses against each other. But the recommendation is lacking sufficient consideration of how researchers are to discover substantive hypotheses to compete with each other. Although Phaf is certainly correct that post-hoc hypotheses based on previous data can become a priori hypotheses in the future, this possibility seems inadequate to meet goals. After all, as Phaf himself complains, there is much published data—too much according to Phaf—and so there is more than sufficient data for ad hoc hypothesizing. But it has not happened at the level Phaf desires. Why not?
There are multiple reasons and I will not attempt to cover all of them. But relevant distinctions profitably can be made. One important distinction is between general theories and empirical hypotheses. General theories cannot be tested directly because they contain non-observational terms; it is necessary to include auxiliary assumptions to connect non-observational terms in theories with observational terms in empirical hypotheses. In contrast, empirical hypotheses contain observational terms—or at least terms that are more observational—and can be tested more directly than can theories. 1 Here we come to an ambiguity. Are researchers to come up with major theories, empirical hypotheses, or something in between? To expect a new major theory in each published article is unreasonable in the present academic publish-or-perish environment. Perhaps this can be changed, but it is not clear how to bring such change about. Moving to hypotheses, it is already customary in many areas—for example, management and marketing journals—to list hypotheses and sometimes these are competing. Yet it is far from clear that progress is any faster or surer in these journals than in others. Unfortunately, empirical hypotheses that are detached from a more unifying theory are not often of much value. Put another way, hypothesizing only to hypothesize, when there is no theory to which to connect the generated hypotheses, likely will not contribute substantially to the progress of science. Perhaps Phaf’s (2020) recommendation should be altered slightly to recommend that the hypotheses be connected to theory via auxiliary assumptions.
It also is possible to distinguish between empirical findings and empirical laws. Hempel (1965) argued that researchers can derive theories, whether competing or not, as explanations of empirical laws. Certainly, there are many cases in the history of science where empirical laws were established first, and theoretical explanations came later. But in psychology, and perhaps the social sciences more generally, there is a general lack of established empirical laws. The infamous irreproducibility crisis points to this lack. It is very difficult to demonstrate empirical laws when researchers cannot even get direct replications, not to mention conceptual replications, to work. Of course, Phaf (2020) is correct that good theoretical thinking can aid in deciding what kinds of replications are appropriate and what kinds are not. Nevertheless, it would be an overreach if we were to put all, or even most, of the blame for irreproducibility on lack of good theorizing, as there are other important reasons for irreproducibility (Earp & Trafimow, 2015; Trafimow & Earp, 2016). As I have detailed elsewhere, our inferential procedures—significance tests and confidence intervals—are problematic (e.g., Trafimow, 2019a). But our notion of what counts as a successful replication is also problematic (Trafimow, 2018a), and Phaf does not provide a clear alternative notion. I have provided such an alternative but will not repeat the description in this short comment. 2 Returning to Hempel, without a proper conception of replication—surely a prerequisite for establishing empirical laws—a central road to good theorizing that otherwise would run through empirical laws is mostly closed.
Phaf recommends that researchers synthesize more, with meta-analysis as a potentially valuable procedure for better theorizing. Although I would not argue against researchers synthesizing, I believe there are important limits, both with respect to the general notion of synthesizing and using meta-analysis to do it. In an article I published several years ago (Trafimow, 2012), I contrasted integration (synthesis) against unification. My argument was that although integration can be useful, the major action in the history of science pertains to unification, which includes discovering a basic principle underneath the phenomena of interest that makes us see what we know in a different light (Whewell, 1840). Galileo, Newton, and Einstein provide examples from physics; Darwin provides an example from biology; and there are examples in other sciences too. In psychology, it would be difficult to find comparable unification examples (but see Trafimow, 2012, for attempts). I would hate to see so much emphasis on integration that nobody in psychology attempts unification.
For those who feel unification to be too much of a challenge in psychology, and who are willing to settle for integration, is meta-analysis really as powerful a tool as it is cracked up to be? I am ambivalent. There can be little doubt that literature reviews can bring points to light that otherwise would be difficult to see and can have important influences on the subsequent directions of large research areas in psychology. An example is the famous review of the literature on attitude–behavior relationships by Wicker (1969). Wicker performed a narrative review of the literature, sans meta-analysis, and showed that the general rule was for poor, nonexistent, or even negative attitude–behavior correlations. Because most researchers at that time defined attitudes as “predispositions for behaviors,” Wicker’s review constituted an extremely important counterforce in the history of attitude research. The article provided a major impetus for the social psychology crisis in the 1970s, in which researchers became extremely concerned about whether attitude—the dominant social psychology construct at the time—deserved its top billing. Meta-analysis often is assumed to add to the power carried by narrative reviews because of the possibility of computing descriptive and inferential statistics across many studies. But there is a problem too, especially with respect to inferential statistics. Specifically, the computations of p-values or confidence intervals depend on many underlying assumptions (see Bradley & Brand, 2016; Trafimow, 2019b, for taxonomies of these assumptions). Some of the assumptions are surely false, even within single studies; and even more wrong when applied to congeries of studies. For example, in the thousands of psychology articles I have read, I have never encountered one where there was random and independent sampling from a defined population; though this is a basic requirement of statistical tests used to obtain p-values and confidence intervals. When there are many studies, each with a different configuration of geography, laboratory setup, time, experimental paradigm, and so on; the underlying assumptions are nowhere close to being met. Although p-values and confidence intervals are generally a bad idea anyway, 3 the blatant violations of underlying assumptions when combining across studies, each study representing a different population, makes them even less interpretable.
The descriptive statistics are problematic too. To see why, imagine Laplace’s (1951) Demon, who knows everything, assured us that sample statistics in a particular study have absolutely nothing to do with corresponding population parameters. We would promptly throw out the study, thereby demonstrating that sample statistics mean little unless accompanied by some assurance that they are relevant to corresponding population parameters. Well, then, in the case of an average study, unless the reader (or a reviewer) can generate a reason to disbelieve that the sample statistics are related to corresponding population parameters, we tend to take it on faith that there is some sort of relationship, though few would expect the relationship to be perfect. But in the case of a meta-analysis, that which constitutes the population is especially problematic because each study included in the meta-analysis is from a different population. With different configurations of study characteristics, there is no particularly relevant population; and consequently, the meaning of the descriptive statistics as they pertain to population parameters is ambiguous. Suppose a researcher performs a meta-analysis across 30 studies and is interested in the weighted mean. Well, then, because of the many configurations of study characteristics across the 30 studies, there is sampling from 30 different populations. The interpretation of the weighted mean, given that the samples represent 30 different populations, is murky at best and downright nonsense at worst. More generally, there is little added benefit of formal meta-analysis over narrative reviews containing tables of findings from the included studies and careful descriptions of them. In fact, formal meta-analysis may be worse because its scientific aura may create unwarranted confidence in the resultant inferences.
In conclusion, Phaf (2020) wishes to require mandatory theorizing, a wish with which I sympathize, but believe should be tempered by a strong recognition of the difficulties involved. We have seen that two roads to better theory—the Hempelian road and the integration (synthesis) road—are problematic. The former because of the irreproducibility crisis and the difficulty of establishing general laws; and the latter because of (a) the limitations inherent in synthesis without unification and (b) the limitations of meta-analysis as a synthesis tool. My fear is that, to meet the demand of mandatory theorizing, researchers will simply trump up hypotheses that neither Phaf nor I would favor. This can be done in a variety of ways that are deleterious to psychology. For example, consider a researcher who wishes to show that one hypothesis is better than a competing hypothesis but is not sure how to derive the competing hypothesis. It is difficult to avoid the temptation to set up the experimental paradigm so that the favored hypothesis makes a prediction and the straw-person hypothesis does not. After invalidly interpreting the lack of a prediction as being equivalent to predicting no effect; demonstrating an effect seemingly shows that the favored hypothesis is superior, while complying with Phaf’s plea for researchers to test competing hypotheses. Phaf has contributed a laudable piece; but properly implementing the suggestions is easier said than done.
Footnotes
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
