Abstract
Wood, Kressel, Joshi, and Louie (2014) meta-analyzed studies examining changes in women’s mate preferences as a function of cycle phase, and claimed to find little evidence for shifts, contrary to Gildersleeve, Haselton, and Fales’s (2014a) meta-analysis. This commentary concerns specific speculations Wood et al. made about particular researchers analyzing data multiple ways, capitalizing on chance and thereby inflating the Type I error rate. In so doing, Wood et al. misconstrued a key article explaining the high fertility period, misrepresented studies, and presented no supportive evidence. The corrosive effects of inappropriate research practices on scientific literatures are concerning. So too are unsubstantiated speculations of them.
Keywords
Wood et al. (2014) meta-analyzed studies examining changes in women’s mate preferences as a function of cycle phase, and found little evidence for systematic effects. Gildersleeve et al. (2014a), meta-analyzing the same domain, reported highly robust effects (though also gaps in understanding). Differences that drive contrasting conclusions are multiple (see Gildersleeve, Haselton, & Fales, 2014b), but are not the focus here. Wood et al. (2014) found certain groups of cycle effects to be robust—specifically, on preferences for symmetry and health, the effect for masculinity preferences just falling short of significance (see also Gildersleeve et al., 2014a, 2014b). But they conclude, “The few instances in which women’s preferences shifted across the cycle appeared to be largely artifacts of research practices” (p. 245). They furthermore speculate that some researchers made post hoc decisions about analyses that inflated Type I error rates, practices now known as p-hacking (Simmons, Nelson, & Simonsohn, 2011; Simonsohn, Nelson, & Simmons, 2014). As documented next, their speculations involve three misrepresentations of data.
The Suspicions
Wood et al. (2014) note an association between size of the fertile window and effect size. They speculate that this association occurred because certain researchers made post hoc decisions about how to define the fertile window. They claim, “The symmetry and health preference shifts evident in studies that did not specify a relationship were obtained primarily using broad fertile windows” (7 days or more) (p. 242). Citing Wilcox, Duncan, Weinberg, Trussell, and Baird (2001), Wood et al. assert, “researchers cannot easily justify including more than 6 days in the fertile phase in a standard 28-day cycle” (2014, p. 242). They imagine a scenario in which researchers “may have begun with this narrow designation, and in the case of finding nonsignificant results, successively broadened the fertile phase definition in exploratory analyses designed to maximize the distinction between fertile and nonfertile women’s preferences in a given sample,” procedures that “would have inflated Type 1 error and produced spurious evidence of cycle shifts” (2014, p. 243).
In citing this section of Wood et al.’s (2014) article, Harris, Pashler, and Mickes (2014) refer to these practices as “p-hacking.” Researchers p-hack when “wittingly or unwittingly exploiting unacknowledged degrees of freedom in order to obtain positive results (Simmons, Nelson, & Simonsohn, 2011)” (Harris et al., 2014, p. 1260)—that is, making undeclared post hoc decisions about data analyses that inflate the Type I error rate. I emphasize that Wood et al. never use the term “p-hacking” in their article. Because this term offers a simple handle, I use it in this article.
Three Errors in Wood et al.’s Speculations
Timing of the Fertile Window
Wood et al.’s (2014) speculation rests on the claim that researchers cannot easily justify using a fertile window larger than 6 days. Those who did may have started with a 6-day window and, not finding significant results, successively widened it until significance emerged. This scenario, Wood et al. claim, could explain the positive association between window size and effect size, which Harris et al. (2014) were “unable to think of any completely benign explanation for” (2014, p. 1263).
In fact, Wood et al.’s (2014) claim that researchers cannot justify high fertility periods longer than 6 days misrepresents Wilcox et al.’s (2001) article. Wilcox et al. stated, We have estimated previously that the mean probability of a clinical pregnancy with a single act of intercourse is 0.04, 0.13, 0.08, 0.29, 0.27, and 0.08 for the 6 consecutive days ending with ovulation [6]. (Outside this 6-day interval, the estimated probability of pregnancy is < 0.01.) (2001, p. 211)
Wood et al. (2014) quoted the parenthetical remark only, stripped of context: One can specify a precise 6-day window, but only when one can confirm the day of ovulation. The very next sentence states the thrust of Wilcox et al.’s article: “In the present article, we extend these estimates by incorporating the variability in day of ovulation” (2001, p. 212). Though some research has confirmed ovulation with luteinizing hormone (LH) tests (e.g., Gangestad, Thornhill, & Garver-Apgar, 2005; Larson, Pillsworth, & Haselton, 2012; Pillsworth & Haselton, 2006), the vast majority of studies lack a confirmed day of ovulation, and researchers must estimate conception risk from day of the cycle and cycle length alone. Naturally, because ovulation does not occur on exactly the same day every cycle, days associated with relatively high fertility are spread across a range larger than 6 days. (Even knowing the exact day a cycle ended leaves much uncertainty about day of ovulation; ~80% of luteal phase lengths are 14 ± 2 days, ~20% falling outside of this 5-day period; Baird et al., 1995.)
Wilcox et al.’s (2001) Table 1 reports estimated conception risks based on day of the cycle when day of ovulation is unknown. The highest conception risk (Day 13) is .086, the lowest .000. When lacking LH tests, we’ve generally analyzed data based on quantitatively varying estimates (using Wilcox et al.’s in Gangestad, Garver-Apgar, Simpson, & Cousins [2007], but Jöchle’s [1973] estimates in earlier studies), procedures likely most valid and hence preferred. But if one is going to define high and low fertility days (and not eliminate days on the margins), an appropriate split divides the cycle into those days with conception risk closer to .086 (relatively high fertility days) and those with risk closer to .000 or, alternatively, maximizes the point biserial correlation with continuously varying estimates. The result is a window of 9 or 10 days. (For regular cycles, it is 9 days. Jöchle’s estimates yield 8 or 9 high fertility days.) When one adopts the 6-day high fertility phase that Wood et al. (2014) claim is the maximum justifiable, a day with .061 conception risk—nearly three quarters the highest (.086)—is deemed low-fertility.
Clearly, researchers can justify treating more than 6 days as relatively high fertility. Moreover, aiming for a more precise phase of 6 days can actually decrease validity of measurement. When the entire range of days is included in analyses, larger windows (8–10 days) should yield stronger effects in the long run than shorter windows (2–6 days). Moreover, larger window sizes may better capture variation in progesterone levels (the luteal phase the only one characterized by high levels), perhaps particularly important to some effects (see Jones, 2014; e.g., nicely illustrated by the detailed endocrinological work by Roney & Simmons, 2013, examining hormonal predictors of women’s sexual desire). These scenarios are “benign” explanations that Harris et al. (2014) could not think of (see also Gildersleeve et al., 2014b).
Timing Effects Versus Study Type
Wood et al. (2014) also argue that robust patterns are artifacts because of timing effects. Notably, older symmetry studies yielded larger effects than newer ones. However, year of publication is strongly confounded with study type. Studies examining preferences for the scent-of-symmetry appeared 1998–2003, whereas those examining preferences for facial symmetry appeared 2002–2009, rpb between study type and year of publication > .8. Scent studies yielded much larger effects: rpb between study type and effect size > .8, p < .01.
Why scent studies yielded larger effects is unclear. Research practices are potential reasons, but so too is a difference in true effect size. As Gangestad and Thornhill (1998) noted, changes in women’s olfactory processing across the cycle are well-established (e.g., Hummel, Gollisch, Wildt, & Kobal, 1991), and steroid hormones very likely affect olfaction (Moffitt, 2003). At the same time, facial symmetry reflects a narrow range of developmental outcomes and, hence, may be a relatively weak indicator of developmental instability (Gangestad & Thornhill, 2003). When preference type is controlled, year of publication does not predict effect size (r > .3, p > .4). The observed decrease in ovulatory shifts in symmetry preferences, then, could simply reflect change over time in the preferences studied. More generally, purported “timing effects” could reflect confounds with other study features (see Gildersleeve et al., 2014b).
Misrepresentation of Specific Studies
Wood et al. (2014) included in their meta-analysis three studies I coauthored (Gangestad & Thornhill, 1998; Thornhill & Gangestad, 1999; Thornhill et al., 2003). All three examined women’s preferences for the scent of symmetrical men, Thornhill and Gangestad (1999) reporting a replication and extension of the original study, Thornhill et al. (2003) a partial replication. We are among those suspected of having capitalized on researcher degrees of freedom because, as Wood et al. portrayed our studies, we used broad fertile windows (9 days) in two studies, a continuous measure in the third (Thornhill & Gangestad, 1999).
More generally, however, Wood et al. (2014) portray our studies inaccurately. In each study, we analyzed data two ways (and, notably, the same ways, with the exception that we took cycle length into account in the last two studies). We examined effects using men as units-of-analysis by splitting the sample of women into high fertility (9 days) and low fertility (the rest of the cycle). We then compared correlations between men’s asymmetry and mean scent attractiveness ratings made by these groups. (In so doing, contrary to Wood et al.’s speculation, we never analyzed our data using a 6-day window.) However, because these analyses were “based only on rough and fixed categories of fertility risk” (Gangestad & Thornhill, 1998, p. 930), our primary analyses treated women as units-of-analysis (as studies generally do) and regressed women’s preference-for-symmetry scores on a continuous estimate of conception risk based on day of the cycle, published actuarial estimates, and (in the later studies) cycle length. These effects were the strongest evidence we reported for cycle effects, mean p < .01, which Wood et al.’s imagined scenario of moving through successive fertility windows can’t speak to. If someone makes a speculation with no direct knowledge of circumstances, it ought to at least be plausible; in these instances, it is not even possible.
P-curves
Prior to putting suspicions of p-hacking (or equivalent phenomena, whether using the term or not) in print, authors should marshal evidence in support of them. Simonsohn et al. (2014) laid out a method for detecting p-hacking—the p-curve. (Conveniently, the method is also able to rule out publication bias alone as an explanation for positive effects.) It relies on the fact that the distribution of significant p-values in a literature is a function of real effects in the world, statistical power to detect them, p-hacking, and sampling distribution. If no true effects exist—for example, under the null hypothesis, reported effects being due to publication bias—this distribution will, naturally, be uniform. P-hacking moves some nonsignificant p-values to < .05 through adventitious post hoc decision-making, leading to a left-skewed p-curve, ps piling up near .05. By contrast, if true effects exist in the world, the p-curve is right-skewed, values overrepresentatively < .01.
Wood et al. (2014) did not examine p-curves for evidence for capitalization of researcher degrees of freedom in this literature. Gildersleeve et al. (2014b) did so. Evidence for real effects, not p-hacking or solely publication bias, emerged, with estimated effect size consistent with their meta-analysis, and inconsistent with Wood et al.’s claim, “The few instances in which women’s preferences shifted across the cycle appeared to be largely artifacts of research practices” (2014, p. 245). As Gildersleeve et al. (2014b) note, had critics “examined the publicly available evidence—published p values—they would have found little evidence that the literature on cycle shifts is plagued by false positives” (p. 1278).
Final Word
Speculations that specific researchers, within particular contexts, engaged in inappropriate research practices, in absence of any direct knowledge, can be wrong. Wood et al. (2014) made three errors: They misinterpreted the implications of a key article on fertility risk, overlooked an obvious confound when examining timing effects, and misrepresented several studies, suggesting that multiple analyses were performed before the authors selected which one to present, when a careful reading of the articles shows this scenario to be mistaken.
But what are the costs of wrongful speculation? In the discussion section of articles, authors often offer speculations (e.g., concerning possible interpretations), and may even be encouraged to do so (e.g., to spawn fruitful directions of research). In such cases, the benefits of speculation are thought to often outweigh the costs of mistaken speculation. Wrongful speculations that specific researchers engaged in inappropriate research practices, however, constitute serious allegations. They can cause considerable harm to those researchers through impacts on ability to publish, ability to compete for grants—ultimately, to earn and retain employment as researchers. The corrupting effects of p-hacking on psychological literatures are concerning. So too, however, are the corrosive impacts of unfounded allegations of p-hacking, which raises the question, when should such allegations be published? In light of their potential harm, in my estimation, only with very deliberate care and caution.
