Abstract
Stereotype threat has been offered as a potential explanation of differential performance between men and women in some cognitive domains. Questions remain about the reliability and generality of the phenomenon. Previous studies have found that stereotype threat is activated in female chess players when they are matched against male players. I used data from over 5.5 million games of international tournament chess and found no evidence of a stereotype-threat effect. In fact, female players outperform expectations when playing men. Further analysis showed no influence of degree of challenge, player age, nor prevalence of female role models in national chess leagues on differences in performance when women play men versus when they play women. Though this analysis contradicts one specific mechanism of influence of gender stereotypes, the persistent differences between male and female players suggest that systematic factors do exist and remain to be uncovered.
The topic of sex differences in cognition evokes strong reactions, including accusations of sexism, essentialism, political correctness, and the denial of human nature (Fine, 2010; Halpern et al., 2007; Pinker, 2003). As psychological scientists, we know that the reality of any observed sex difference is one issue, and the causal pathways leading to any observed sex differences is another. Simply put, we cannot infer from a real difference between the sexes that this difference is inevitable, immutable, or inborn (Griffiths, Machery, & Linquist, 2009; Mameli & Bateson, 2011). To diagnose any difference as innate, we would need clarity on the mechanisms producing that difference; mechanisms that potentially span genetic inheritance, developmental influences, the interactions of genetics with the environment, and the ongoing influences of adult society on cognitive performance.
Possible environmental influences on sex differences in cognition come in different flavors. There are those that affect the development of skills and preferences across the life span; those that, through cultural ideas of gender, affect others’ judgment; and those that affect our own behavior. Demonstrating the reality, or lack of reality, of one potential mechanism does not speak to the reality of the others. Nevertheless, if we are to devise an accurate account of the emergence of sex differences in cognition, each potential mechanism needs to be tested and verified.
Stereotype Threat
One notable psychological phenomenon that can influence performance on cognitive tests is stereotype threat, whereby an individual’s awareness of a negative stereotype influences his or her performance (Inzlicht & Schmader, 2012). This was originally proposed for African Americans and intelligence test performance (Steele & Aronson, 1995) and has since been extended to other domains, most pertinently for our purposes to women and performance in nonstereotypically feminine domains of achievement, such as mathematics (Spencer, Steele, & Quinn, 1999).
Stereotype threat has been offered as a partial explanation for sex differences on cognitive tasks (e.g., Fine, 2010). The suggested mechanisms for the effect are plausible—increased anxiety, performance monitoring, or negative thought suppression, which create additional working memory load (Beilock, Rydell, & McConnell, 2007; Schmader, Johns, & Forbes, 2008)—but it is important to recognize that (a) establishing the reality of even a true effect in laboratory conditions is not straightforward and (b) regardless of the reality of stereotype effect, there are other reasons for sex-differentiated performance (cf. Sackett, Hardison, & Cullen, 2004).
Stereotype Threat and Publication Bias
Recent analyses have suggested that the literature on stereotype threat suffers from publication bias (Doyle & Voyer, 2016; Flore & Wicherts, 2015; Ganley et al., 2013; Stricker, 2008). If studies reporting a positive effect are more likely to be published, then this will exaggerate the true size and robustness of stereotype threat. Despite this, other meta-analyses have attested to the reality of the effect (Doyle & Voyer, 2016; Lamont, Swift, & Abrams, 2015; Nguyen & Ryan, 2008). One 2016 review states that “stereotype-threat effects are generally robust, with moderate to small effect size” (Spencer, Logel, & Davies, 2016, p. 418).
An approach that may complement experimental studies of stereotype threat is to investigate its impact on cognitive performance outside the lab. This also makes it possible to assess the importance of stereotype threat amid the myriad influences on behavior in daily life. Field studies make it possible to access vastly increased statistical power over typical experimental studies.
Chess
Chess has an illustrious history within cognitive science (Charness, 1992; Chase & Simon, 1973; Newell, Shaw, & Simon, 1958), providing a paradigmatic example of cognitive skill, and a test bed for theories of skill acquisition and performance. Aside from its worldwide popularity and historical and cultural interest, chess has the advantage of being a skill with minimal perceptual or motor requirements. The upper bound on an individual’s performance is his or her cognitive capacity in planning and his or her ability to reason through the complex space of possible moves. Chess also has the advantage that players are rated using the Elo system (Elo, 1978), which updates according to a player’s success or failure in games against other rated players. This provides an objective measure of skill that is not directly contaminated by the subjective perception of observers.
Chess is heavily male dominated both in terms of the absolute number of male players and in terms of male representation among the best chess players. The stereotypical chess grandmaster is undeniably a man, and—because of the face-to-face nature of tournament play—it is difficult for gender not to be salient when a female chess player competes with a man. If the stereotype-threat phenomenon is robust and general, then we should be able, with the right analysis, to observe it operating in chess.
Previous research has explored a number of possible competing explanations for the underrepresentation of women in chess (Bilalić, Smallbone, McLeod, & Gobet, 2009; Chabris & Glickman, 2006). In chess, both observational (Rothgerber & Wolsiefer, 2014) and experimental (Maass, D’Ettole, & Cadinu, 2008) studies appear to confirm the existence of stereotype threat. Rothgerber and Wolsiefer (2014), looking at 219 female chess players, report that “stereotype threat susceptibility was most pronounced in contexts that could be considered challenging: when playing a strong or moderate opponent” (p. 79). Maass and colleagues (2008) ran a study using Internet chess in which the perceived gender of opponents was experimentally manipulated with 42 female participants. When they believed they were playing an opponent of the opposite gender, female players were less likely to win. If these findings apply widely to chess performance, they have the potential to systematically undermine the performance of female players.
So although an obvious disparity exists in participation rates between men and women, there is uncertainty over the mechanisms by which this is perpetuated. In particular, the phenomenon of stereotype threat offers a specific psychological mechanism whereby cultural stereotypes and the existing relative paucity of female role models can interact with gender to hamper women’s achievements in chess, but this has not been convincingly established for a wide age range playing at the higher levels of the game. This is what I set out to do in the present study. Apart from their importance to understanding chess, these data also provide an opportunity to interrogate a real-world domain for the reality, or not, of the effects of gender on performance, including any stereotype-threat effects.
Method
The data consist of records of 9,662,202 games of standard tournament chess played between January 2008 and August 2015. There are also records of 461,637 players rated by FIDE (the World Chess Federation; 56,474, 12.2%, of these players are women). The average birth year for all included players was 1983, and they had an average age of 31.5 years (SD = 19.28) at the time the games were played. In recent years, an increasing number of younger players have joined the rating system, expanding the number of rated players and lowering the average rating.
For each player, the data consist of a unique player ID, date of birth, gender, nationality, and details of the games he or she played (including the color he or she played as—White or Black—who the opponent was, the tournament this was part of, and the outcome). The data also contain all players’ official FIDE ratings calculated according to the Elo system. This system updates players’ ratings according to game outcomes, and it can be used to predict the outcome of a match between any two rated players and rank any player against the historical community of all players within the system. Because of this, it is possible to compare players who have never played and may not even be contemporaneous.
When analyzing game outcomes, I included only games of standard tournament chess between players who both possessed FIDE ratings and were active during the 92-month period for which I had data. This left 5,558,110 games from 150,977 male players and 16,158 female players.
To investigate the possibility of stereotype threat, I compared women’s performance when playing against a man, and when playing against another woman, with the expected outcome from when a man plays against a man. An advantage of chess is that we are able to precisely gauge the challenge presented by individual games to each player by comparing players’ Elo ratings. As well as looking at the difference in outcome by gender of opponent, I also investigated whether player age and prevalence of other female chess players affected the outcome.
Analysis scripts, as well as a sample of 5% of players represented in the full game-by-game data set, are available at the Open Science Framework (OSF; https://osf.io/aeksv). For commercial reasons, this full raw data set is not available at the time of writing. I do provide the full (summary) data that support the key analyses presented here. While I acknowledge that it is not appropriate to use null-hypothesis significance testing (NHST) to guide interpretation of my data, I do report the p values of standard null hypothesis tests in places. This is to ease comparison for readers familiar with NHST; such readers will note that no p values I report are marginal. Everything that might be considered significant is extremely significant, and everything that is not significant is resolutely not significant.
Results
Differences in ratings
In the player record, the average FIDE rating for men was 2,070 (SD = 186) and for women 1,978 (SD = 195). This difference is statistically significant, t(460345) = 35.51, p < .001. For reference, a rating above 2,500 is associated with Chess Grandmaster level (at this level, 98.9% of players in these data were male). The ratio of the standard deviations of ratings for women to men was 1.05, showing higher variability in women’s ratings (as in the findings of Chabris & Glickman, 2006).
Differences in by-game performance
These data also allow us to look at how individual game performance is affected by player characteristics. The Elo system provides a predicted outcome for any match based on the rating difference between the two players. Figure 1 shows the observed relationship between rating difference and game outcome for games featuring men only. The rating difference is the rating of the player playing White minus the rating of the player playing Black. For outcome, a win for the player playing White was coded as 1, a win for the player playing Black was coded as 0, and a draw was coded as .5.

Average game outcome as a function of difference in players’ Elo ratings (data are from 4,659,239 games from male-only competitors). For game outcome, a win for the player playing White was coded as 1, a win for the player playing Black was coded as 0, and a draw was coded as .5; 95% confidence intervals are given but are too small to be visible.
As expected, there was a clear relationship between the relative player rating and game outcome. Note that at around 0 difference in player ratings, the average outcome is above .5—showing, as is widely known, that the White player has an advantage. In order to subsequently calculate predicted outcome for any rating difference, I fitted a logistic function to the observed data for games featuring male players only.
I coded all the games in the data set according to whether they are played between two men (MM), two women (FF), or mixed-gender pairings, with a woman playing White (FM) or Black (MF). The difference in rating allows us to precisely operationalize the challenge presented by each game. Stereotype threat is most likely to manifest in challenging situations (Rothgerber & Wolsiefer, 2014), and playing someone with a higher rating would be a perfect example. International chess tournaments are certainly challenging, and the difference in Elo rating allows us to precisely gauge the challenge presented in any particular pairing.
Using the function derived from MM games between two men (see above), I calculated the average difference between the predicted outcome and the expected outcome for both FM and MF games (reversing the sign for MF games, so that for both FM and MF games, a negative number represents a worse-than-expected outcome for the female player). This calculation tells us how female players perform, relative to expectations, when facing a male player. I did this across the range of possible rating differences for players, using a binning width of 125 Elo points. The results are shown in Figure 2. Note that this figure shows the variation around the function shown in Figure 1: By removing the variation due to rating difference, it allows us to focus on the other factors that influence game outcome.

Mean difference between predicted game outcome and expected game outcome as a function of difference in players’ Elo ratings, separately for games between two women (FF) and mixed-gender pairings (regardless of whether the woman played White or Black; FM + MF). The Elo difference is calculated for the White minus Black player for the two male (MM) and two female (FF) pairs and for the female minus male player for FM and MF players. The baseline expectation from analysis of games between two men is shown in black. Shaded regions indicate 95% confidence intervals. Data are from 5,558,110 games total.
A stereotype-threat effect should reduce the probability of a woman winning when she plays a man, compared with when a man plays a man (the baseline) or when a woman plays a woman. Graphically, this should appear as a lower curve for the mixed-gender group (FM + MF). In particular, we would expect that this effect would manifest most strongly when a woman plays a superior opponent (so in the negative portion of the x-axis).
The opposite was the case—female players outperformed expectations when facing male players, across the whole range of rating differences. Note the scale on Figure 2: A difference of 0.01 from the predicted outcome is a 1% increment in the probability of winning a game, or one extra win in 100 games, compared with the baseline expectation. The observed average for mixed pairs was above the average for same-sex pairs (both MM and FF). This is the opposite of a stereotype-threat effect, reflecting a lift in female chess players’ performance when playing a male opponent above their rating-predicted performance.
Another angle on these data is to look for upsets—games with a strong favorite (based on Elo ratings) in which the favorite lost. 1 I took a rating difference of 500 Elo points as an arbitrary threshold for defining games with a strong favorite (note from Figure 1 that this rating difference predicts a victory for the stronger player with ~95% probability). Between two male players (MM), 3.18% of such games resulted in upsets, and between two female players (FF), 2.83% of such games resulted in upsets. The number of upsets was higher for mixed pairs (FM and MF; p < .0001 using Fisher’s exact test). Of those games between mixed pairs where the female player was overmatched, upsets occurred 3.70% of the time. Of those games between mixed pairs where the male player was overmatched, upsets occurred 3.51% of the time. Although upsets were numerically more likely to favor the female player, this was not statistically significant (p = .562 using Fisher’s exact test).
To confirm the pattern of negative stereotype threat, I switched to using the individual players rather than games as a base unit of analysis. The advantage of this is that it better controls for confounding factors, such as a change in both the rating and gender proportion of players across time (e.g., if more women and more weaker players are entering the international chess ratings). Using each female player as her own control, I calculated the difference between actual game outcome and expected game outcome given the relative rating of the players, for both games where she played another woman and for those where she played a man.
Over all female players, the average stereotype-threat effect was 0.014, which is significantly different from zero (95% confidence interval = [0.010, 0.017]), and which was again a reverse of the classic stereotype-threat effect. 2 Figures 3 and 4 show that there is no systematic variation in the size of the stereotype threat by the proportion of female players in different national chess leagues, or by birth year of the player. 3 To confirm this, I fitted a regression model predicting the size of the stereotype-threat effect for each female player from player’s birth year and the proportion of female players in her country of origin, as well as the interaction. Estimates of the influence of these factors all overlapped with zero, as shown in Table 1—estimates were based on an overall model that explained little of the variance, R2 = .003, F(3, 12687) = 13.72, p < .001.

Country-level stereotype-threat effect as a function of the proportion of female chess players in that country. Error bars indicate 95% confidence intervals.

Average stereotype-threat effect (data points) and proportion of female players in the data set (continuous line) as a function of birth year. Error bars on the data points show 95% confidence intervals.
Results of the Regression Predicting the Size of Stereotype-Threat Effects Across Individuals
Note: CI = confidence interval.
Discussion
The data I used here allow us to explicitly test for the operation of stereotype threat in this particular domain as one candidate mechanism by which social context may affect performance. Contrary to previously published research (which used smaller samples and a narrower range of abilities; Maass et al., 2008; Rothgerber & Wolsiefer, 2014), these results showed that stereotype threat does not appear to affect chess at this level. Female players, far from suffering a stereotype threat, display a boost in performance when playing men compared with playing women.
I note that tournament chess is a different task from those that were used to establish the stereotype-threat phenomenon. In particular, for any rated player, chess will be a highly familiar task, and task novelty has been shown to interact with stereotype threat via arousal (Ben-Zeev, Fein, & Inzlicht, 2005; O’Brien & Crandall, 2003). So, paradoxically, it could be that stereotype-related anxiety raises performance, protecting against threat effects in these data. 4 It may be that the older age of the sample, the higher playing standard, or the greater pressure of international competition induces a professionalism among players that also protects against stereotype threat.
If stereotypes are not negatively affecting female players’ performance against male players in chess, what mechanisms are producing the difference for mixed pairs compared with single-sex pairs? One plausible mechanism is a degree of male underperformance rather than female overperformance. This could be due to male underestimation of female opponents, misplaced chivalry, or “choking” from the ego-threat of being beaten by a women (Baumeister, 1984). I note a recent analysis of Grand Slam tennis that suggests that men may be particularly vulnerable to choking (Cohen-Zada, Krumer, Rosenboim, & Shapir, 2017). The analysis of upsets supports this idea. It seems more likely that any psychological factor would cause a favorite to throw a game with an unwise move than that an underdog would be able to play a whole game at the level required to overcome a large rating-difference disadvantage. 5
The question of the underrepresentation of women in chess remains unsolved. I have merely provided evidence that stereotype threat is an unlikely mechanism for sustaining any difference in male-female ratings once players have achieved a standard that allows them to hold an FIDE rating. Some researchers (Bilalić et al., 2009; Charness & Gerchak, 1996) suggest that the gender difference at the top of the distribution is a natural consequence of different participation rates—in other words, that the low number of women in the highest echelons of chess is the simple result of the much larger number of men in the population of chess players from which the best players are drawn. It is certainly a problem that analysis of rated players limits the conclusions that can be drawn, because we are in effect looking only at a subset of all possible players (Vaci, Gula, & Bilalić, 2014). From this perspective, the primary factor that may need to be explained is the difference in participation between men and women in chess itself, rather than any difference in ratings or maximal achievement (which may be explained sufficiently by differential participation).
Recently, chess has been a focus for large-scale analytics (Chassy & Gobet, 2015; Howard, 2006; Leone, Slezak, Cecchi, & Sigman, 2014; Vaci & Bilalić, 2016), and I see this study as part of that trend. Future work with these data has great potential for uncovering differences in change in expertise as well as performance. Future work on chess is sure to focus on within-game dynamics as well as on the dynamics of ratings. To the end of promoting integration of existing work and further exploration of the rich data provided by FIDE chess ratings, I am happy to make the analysis scripts available immediately at OSF (https://osf.io/aeksv), along with a subset of the data, full summary data supporting the regression analysis, and in time, the full raw, game-by-game, data.
The current study shows that the stereotype-threat phenomenon has boundary conditions. A proviso is that the analysis requires one to accept the operationalization used here—that of contrasting games where female players play male opponents with those where female players play female opponents. It may be, of course, that stereotype threat affects female chess players in different ways. Such a broader view of the phenomenon has many advantages (Lewis & Sekaquaptewa, 2016). Nonetheless, in the current study, I looked with a very highly powered statistical lens at female performance in a highly gender-stereotyped domain, using the advantage of a large sample to look in exactly the place where, from a reading of the literature, we would expect to find stereotype threat if it existed (younger players and female players relatively deprived of role models). The evidence suggests no stereotype-threat effect, with—in fact—a small effect in the opposite direction.
Other studies of stereotype threat in high-stakes real-world settings are not consistent (Stricker, 2008; Stricker & Ward, 2004; Walton & Spencer, 2009). For example, one field study failed to show the stereotype-threat effect, showing that gender priming could lift girls’ scores on educational tests (Wei, 2012). Another field study replicated the effect in the original domain (Black students and math performance) but failed to find evidence of the effect in the domain of gender (Stricker, Rock, & Bridgeman, 2015). Obviously, there is significant work to do on defining the conditions under which we can expect stereotype threat to manifest.
Working with very large data sets introduces some new opportunities for the cognitive scientist (Goldstone & Lupyan, 2016; Stafford & Dewar, 2014). Experimental and observational studies complement each other. They have different advantages, such as allowing strong causal inference for experimental studies or more easily allowing high statistical power for observational studies. They also train our scientific imaginations in different ways. Experimental studies encourage us to focus on isolated causal factors. Observational studies encourage us to see all factors in the context of other factors (Stafford & Haasnoot, 2017). Observing a phenomenon in the wild provides a strong validation of the generality and robustness of an effect. Lab studies of stereotype threat have illustrated one mechanism by which social attitudes may create discrimination. This study of one social attitude in one domain—gender stereotypes in chess—does nothing to disprove the reality of discrimination generally, but it does suggest that this one mechanism, stereotype threat, may be more limited in its applicability than one might conclude from reading the experimental literature alone.
Footnotes
Acknowledgements
The Sonas 92 data set used in this analysis was prepared by Jeff Sonas of Sonas Consulting (
Action Editor
Bill von Hippel served as action editor for this article.
Author Contributions
T. Stafford is the sole author of this article and is responsible for its content.
Declaration of Conflicting Interests
The author(s) declared that there were no conflicts of interest with respect to the authorship or the publication of this article.
Funding
This research was supported in part by a Leverhulme Trust project grant on bias and blame (RPG-2013-326).
Open Practices
All materials have been made publicly available via the Open Science Framework and can be accessed at https://osf.io/aeksv/. The complete Open Practices Disclosure for this article can be found at https://journals-sagepub-com.web.bisu.edu.cn/doi/suppl/10.1177/0956797617736887. This article has received the badge for Open Materials. More information about the Open Practices badges can be found at
.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
