Abstract
Whether acquiring a second language affords any general advantages to executive function has been a matter of fierce scientific debate for decades. If being bilingual does have benefits over and above the broader social, employment, and lifestyle gains that are available to speakers of a second language, then it should manifest as a cognitive advantage in the general population of bilinguals. We assessed 11,041 participants on a broad battery of 12 executive tasks whose functional and neural properties have been well described. Bilinguals showed an advantage over monolinguals on only one test (whereas monolinguals performed better on four tests), and these effects all disappeared when the groups were matched to remove potentially confounding factors. In any case, the size of the positive bilingual effect in the unmatched groups was so small that it would likely have a negligible impact on the cognitive performance of any individual.
An enduring debate in the bilingualism literature is whether learning two or more languages affords benefits to cognition over and above the advantages of simply being able to speak a second language (Bialystok, 2017; Lehtonen et al., 2018). Much of this literature has focused on executive function, which refers to those cognitive processes, largely thought to depend on the frontal lobes of the brain, that are responsible for planning, managing, and executing goals (Owen, Downes, Sahakian, Polkey, & Robbins, 1990). The proposed mechanism underlying the bilingual advantage is that language joint activation, monitoring, and selecting rely on domain-general processes that in turn are strengthened through their use in bilingual language control (Bialystok, 2017). Over the years, several studies have shown that bilinguals outperform monolinguals on executive tasks, including tests of inhibition (Hernández, Costa, Fuentes, Vivas, & Sebastián-Gallés, 2010), cognitive control (Bialystok, Craik, & Luk, 2008), attention (Brito, Murphy, Vaidya, & Barr, 2016), working memory (Grundy & Timmer, 2017), and spatial processing (Morales, Calvo, & Bialystok, 2013). In contrast, other studies have reported no executive-function advantages in bilinguals relative to monolinguals (see Lehtonen et al., 2018, for a meta-analytic review).
The most comprehensive meta-analysis to date on the cognitive advantages of bilingualism examined inhibition, shifting, working memory, monitoring, attention, and verbal fluency but found no evidence for a bilingual advantage (Lehtonen et al., 2018). The results of this meta-analysis stand in contrast to those of a recent review, which concludes that bilinguals outperform monolinguals on a wide variety of cognitive tasks, including inhibition, working memory, and attention, and that these advantages appear to extend into old age, protecting bilinguals from age-related diseases, such as Alzheimer’s and other dementias (Bialystok, 2017). However, one problem with many previous studies is that they are based on relatively small sample sizes (Paap, Johnson, & Sawi, 2016). As a consequence, results are often disproportionately affected by other factors that are known to influence performance on tests of executive function (Noble, Norman, & Farah, 2005), such as socioeconomic status (SES; Morton & Harper, 2007), geographic background (Bialystok et al., 2008), and education (Perani et al., 2017). Additionally, a recent analysis of 104 conference abstracts on the topic of bilingualism and executive function revealed a systematic publication bias: Studies that supported the bilingual-advantage theory were more likely to be published subsequently as full journal articles than those that did not (de Bruin, Treccani, & Della Sala, 2015), casting doubt on the validity of any review or meta-analysis of the published literature.
Whether learning a second language is beneficial is not controversial; there are numerous advantages beyond those potentially afforded to cognition. For example, being able to communicate with a larger audience can lead to greater employment opportunities, more friendships, more potential to socialize, and easier travel in locations where those languages are spoken. All of these things are advantageous in and of themselves. However, whether these advantages extend to improvements in various aspects of executive functioning remains a contentious issue, requiring a larger and broader sample of monolinguals and bilinguals to resolve.
The Internet provides a unique opportunity for examining the relationship between bilingualism and executive function in the general population on a huge scale, allowing data to be sampled from participants from a broad range of socioeconomic, geographical, and educational backgrounds. If learning a second language affords advantages for executive function, then a population of bilinguals should outperform a population of monolinguals on a variety of tests of executive function.
To investigate whether speaking two or more languages is associated with improvement in executive function or protects against age-related cognitive decline, we invited participants to take part in an online study consisting of 12 tasks that compose the Cambridge Brain Sciences battery (www.cambridgebrainsciences.com). This executive battery assesses aspects of inhibition, executive function, selective attention, reasoning, verbal short-term memory, spatial working memory, planning, and cognitive flexibility. All participants also completed a detailed questionnaire, describing how many languages they speak, which languages they speak, which country they grew up in, their SES when growing up, the highest level of education they completed, and their age, gender, and handedness.
Method
The experimental protocol was approved by the University of Western Ontario Office of Human Research Ethics (Protocol No. 109196), and all participants gave written informed consent.
Materials
Sociodemographic questionnaire
To obtain information about the number of languages spoken, which languages were spoken, and demographic variables (such as age, country of origin, SES, and education), we asked participants to complete a detailed questionnaire. The questions used in the present study are available in Appendix S1 in the Supplemental Material available online.
Cognitive tests
Twelve cognitive tests were used to assess a broad range of executive functions, such as inhibition, working memory, problem-solving, and planning. An issue that is often raised in bilingualism research is whether the tests being used are sensitive to cognitive changes. These 12 tests have been validated in patients with anatomically specific frontal-lobe lesions (e.g., Bor, Duncan, Lee, Parr, & Owen, 2006; Owen et al., 1990), in neurodegenerative populations with frontostriatal cognitive impairments (Owen, Sahakian, Semple, Polkey, & Robbins, 1995), and in pharmacological intervention studies (e.g., Mehta et al., 2000). Functional-neuroimaging studies in healthy adults (e.g., Hampshire, Highfield, Parkin, & Owen, 2012) and in neuropathological populations (e.g., Williams-Gray, Hampshire, Robbins, Owen, & Barker, 2007) have shown these tests to be associated with activity in frontal or frontostriatal circuitry. The individual tests are described in detail below, and test-retest reliability measures are given in Table S1 in the Supplemental Material.
Double Trouble is a novel and challenging variant of the Stroop test (Stroop, 1935), a test of inhibition that has been widely used in the bilingualism literature (Bialystok et al., 2008; Blumenfeld & Marian, 2014; Rosselli, Ardila, Lalwani, & Vélez-Uribe, 2016). A target word (either “RED” or “BLUE”) is displayed on the screen in either the color red or the color blue. The participant must select the probe word that correctly describes the color that the target word is drawn in. The problem’s color mappings can be congruent (if every word correctly describes the color it is displayed in), incongruent (if either the target word or both probe words are displayed in the opposite color), or doubly incongruent (if the target and probes are both written in the colors opposite to what they describe). Participants have 90 s to complete as many trials as possible. A correct response increases the total score by 1 point, and an incorrect response decreases the score by 1 point.
Spatial Planning is based on the Tower of London task (Shallice, 1982), which is widely used to measure executive function and has been used in the bilingualism literature (Festman, Rodriguez-Fornells, & Münte, 2010; Gunzenhauser, Karbach, & Saalbach, 2019). Numbered beads are positioned on a tree, and participants must rearrange the beads in ascending numerical order. They have 3 min to solve as many puzzles as possible, and the puzzles become progressively harder—requiring more moves and more complex planning. Trials are aborted if the participant makes more than twice the number of moves required to solve the problem. A successfully completed puzzle increases the final score by (2 × minimum number of moves required – the number of moves made).
Odd One Out is based on a subset of reasoning problems from the Cattell Culture Fair Intelligence Test (Cattell, 1949), which has also been used in bilingual-advantage research (Kempe, Kirk, & Brooks, 2015; Macnamara & Conway, 2014). Nine groups of colored shapes are displayed in a grid. The features (color, shape, number of items) define each group and are related to each other according to a set of rules. Participants must deduce the rules that relate these features and select the group with contents that do not correspond to those rules. They have 90 s to solve as many problems as possible, and the puzzles become progressively more difficult. A correct response increases the final score by 1 point, whereas an incorrect response decreases the score by 1 point.
Grammatical Reasoning is based on Baddeley’s 3-min grammatical-reasoning test (Baddeley, 1968). On each trial, a written statement regarding two shapes is displayed on the screen, and the participant must indicate whether the statement correctly describes the shapes pictured below it. The participant has 90 s to complete as many trials as possible. A correct response increases the total score by 1 point, and an incorrect response decreases the score by 1 point.
Feature Match is based on classic feature-search tasks used to measure attentional processing (Treisman & Gelade, 1980). Attention has widely been used to study bilingual advantages (Bialystok, 2015; Brito et al., 2016). On each trial, two groups of items (each with n items) are displayed beside each other. The groups either are identical in their contents (and item positions) or differ by just one item. Participants have 90 s to complete as many trials as possible, indicating whether the groups match. A correct response increases the final score by n, and the subsequent trial has groups of n + 1 items. If the response is incorrect, the total score decreases by n, and the next trial has groups of n – 1 items.
Polygons is based on the Interlocking Pentagons task, a test of visuomotor ability often used for assessing age-related disorders (Folstein, Folstein, & McHugh, 1975). It was included here to assess age-related cognitive decline in our two samples. On each trial, two overlapping wire-framed polygons are displayed on the left side of screen, and participants must indicate whether the shape to the right is identical to one of the two overlapping ones. A correct response increases the total score by the difficulty level, and the subsequent trial will be more difficult (i.e., differences between polygons will be subtler). An incorrect response decreases the total score by the difficulty level, and the next trial will be slightly easier. Participants have 90 s to complete as many trials as possible.
Digit Span is based on the verbal working memory component of the revised Wechsler Adult Intelligence Scale (WAIS-R; Wechsler, 1981) and has been used in bilingual-advantage research (Ratiu & Azuma, 2015; Rosselli et al., 2016). A sequence of digits is displayed one at a time in green in the center of the screen. Participants must then repeat the sequence of digits by selecting them on the on-screen keyboard. Difficulty is dynamically varied, as in previous tests, and the test ends after three mistakes. The resulting score is the length of the longest digit sequence successfully remembered.
Rotations is a task that measures the ability to spatially manipulate objects in mind (Silverman, Choi, Mackewn, Fisher, & Olshansky, 2000). On each trial, two groups of colored squares (each with n squares) are displayed beside each other. One of the groups is rotated by a multiple of 90°. The groups either are identical (when unrotated) or differ by the position of just one item, and participants must indicate whether the groups match. They have 90 s to complete as many trials as possible. A correct response increases the final score by n, and the subsequent trial has groups of n + 1 squares. If the response is incorrect, the total score decreases by n, and the next trial has groups of n – 1 squares.
Token Search is based on a test that is widely used to measure strategy during search behavior (Collins, Roberts, Dias, Everitt, & Robbins, 1998), and similar spatial tasks have been used previously in bilingualism research (Kerrigan, Thomas, Bright, & Filippi, 2017; Morales et al., 2013). A set of boxes, one of which contains a hidden green token, is displayed on a grid. Participants must find the token by clicking the boxes one at a time. Once found, the token is hidden within another box. The token will not appear within the same box twice, so the participant must search the boxes until the token has been found once within each box. An error is committed if the participant checks a box that has already been clicked while trying to find the token or a box that previously contained the token. If the participant makes an error, a new trial begins with one box fewer to search. If the participant finds the token once in each box without making any errors, a new trial begins with one box more to search. The test ends after three errors. The resulting score is the maximum level completed.
Paired Associates is based on a test commonly used to assess memory impairments in aging clinical populations (Gould et al., 2005) and was included here to assess age-related cognitive decline in our two samples. Memory has also been shown to be impaired in patients with neurosurgical removals of frontal-lobe tissue (Owen, Sahakian, Semple, Polkey, & Robbins, 1995). Sets of boxes are displayed at random locations on a grid. The boxes open one after another to reveal an icon, after which they close. The icons are then displayed sequentially in the center of the screen, and the participant must select the box that contained that icon. If the participant remembers all the icon–location pairs correctly, then the next trial will have one box more. If an error is made, the next trial will have one box less. The test ends after three errors. The participant’s score is the maximum number of pairs successfully remembered.
Spatial Span is based on the Corsi block-tapping task—a tool for measuring spatial short-term memory capacity. Researchers have widely studied spatial processing when examining the bilingual advantage (Kerrigan et al., 2017; Morales et al., 2013; Rosselli et al., 2016); the version of the test used here is associated with frontal-lobe activity in healthy participants and is sensitive to frontal-lobe removals in patients (Bor et al., 2006). Sixteen purple boxes are displayed in a grid. A sequence of randomly selected boxes turn green one at a time (900 ms per green square). Participants must then repeat the sequence by clicking boxes in the same order. Difficulty is varied dynamically: Correct responses increase the length of the next sequence by one square, and an incorrect response decreases the sequence length. The test ends after three errors. The score is the length of the longest sequence successfully remembered.
Monkey Ladder is based on a task from the nonhuman-primate literature (Inoue & Matsuzawa, 2007), and similar spatial tasks have been used previously in bilingualism research (Kerrigan et al., 2017; Morales et al., 2013). Numbered boxes are displayed simultaneously at random locations within a grid. After a variable interval (number of squares × 900 ms), the numbers disappear, leaving only the boxes. Participants must click the boxes in ascending numerical sequence. Difficulty is varied dynamically, as in Spatial Span. The test ends after three errors, and the resulting score is the length of the longest sequence successfully remembered.
Experimental design
Data were collected via the Cambridge Brain Sciences online platform (www.cambridgebrainsciences.com). The accuracy of online data has been found to be high (Wesnes et al., 2017), and this particular platform has been used in previous large-scale studies (Hampshire et al., 2012; Wild, Nichols, Battista, Stojanoski, & Owen, 2018). After reaching the website, participants were asked to give informed consent and to register with an e-mail address. They next completed a detailed questionnaire inquiring about demographic and lifestyle items (available in Appendix S1), which took approximately 10 min. They were then asked to complete 12 cognitive tests measuring a broad range of cognitive abilities, including inhibition, selective attention, reasoning, verbal short-term memory, spatial working memory, planning, and cognitive flexibility. This testing period took approximately 35 to 40 min.
Only data from the participants who completed all relevant questionnaire items and all 12 tests were included in the analysis. In accordance with local ethical guidelines, we did not include participants below the age of 18. In total, 11,213 participants met these requirements. Data were then cleaned to remove impossible and improbable questionnaire responses. Test scores were filtered for outliers in two passes: Scores greater than 6 standard deviations from the mean were assumed to be technical errors and were first removed, eliminating 32 participants. Then scores greater than 4 standard deviations from the recalculated mean, which were assumed to be performance outliers, were removed, eliminating 140 participants. Consequently, 11,041 participants were included in the final analysis.
It should be noted that because the data were collected from volunteers who self-selected and were not randomly assigned to groups, the present study is observational rather than experimental. Although we attempted to control for well-known confounding factors by including them in our regression analysis and by matching our samples, there are of course potential unknown confounds that have not been considered and that may explain our findings.
Statistical analysis
Data were analyzed using the R statistical toolbox (Version 3.6.1; R Core Team, 2019), and all figures were constructed using the R package ggplot2 (Version 2.2.1; Wickham, 2016). Chi-square tests were used to assess proportions of SES, handedness, gender, and education between groups, and a two-sided t test was used to compare age between groups.
Given that the data in this study were observational in nature, group imbalances in demographic variables and other potential confounding factors might drive any observed group differences in cognitive performance. To control for such factors, we constructed two groups (monolingual and bilingual) matched in age, education, SES, gender, and handedness using the R package MatchIt (Version 3.0.2; Ho, Imai, King, & Stuart, 2011) with the nearest-neighbor-matching method. Prior to creating the matched samples, we also removed participants who may have masked any positive effects of bilingualism on task performance. Non-English speakers (who were more likely to be bilingual) may have been at a disadvantage, given that the tests and their instructions were provided in English, so only participants who selected English as one of their spoken languages were included in the matching processes. Similarly, participants who indicated that they were bilingual but selected only a single language were not included in data analysis on the assumption that this was an error or that they did not consider themselves fully bilingual. Finally, participants from some countries (Portugal and “other”) were much more likely to be bilingual, and so the matched samples were constructed from individuals only in Canada, the United States, the United Kingdom, and Australia. Descriptive information for the matched samples, with 372 monolinguals and 372 bilinguals, is in Table 1. Descriptive information for the unmatched sample, with 5,994 monolinguals and 5,047 bilinguals, is in Table S2 in the Supplemental Material.
Descriptive Statistics for the Two Groups in the Demographically Matched Sample
Note: Welch’s t test was used to compare age.
Factor analysis
Imaging studies have underscored the fact that there is rarely a one-to-one mapping between cognitive functions and the brain areas (or networks) that underpin them. One approach to this issue is to examine the complex statistical relationships between performance on any one cognitive task (or group of tasks) and changes in brain activity to reveal how one is related to the other. To do this most effectively, researchers must include large amounts of data because of the natural variance in cognitive performance (and brain activity) across tests and across individuals. In the age of computerized Internet testing and “big data,” this problem becomes much easier to solve. Hampshire et al. (2012) collected data on the 12 Cambridge Brain Sciences tasks from approximately 45,000 participants. These data were then subjected to a factor analysis, and three discrete factors relating to overall cognitive performance were identified. Each one of these factors is something that no single test can assess; each represents an independent aspect of cognitive function that is best described by performance on a combination of tests. They were labeled, for convenience, as encapsulating aspects of short-term memory, reasoning, and verbal abilities, respectively. This technique allows an individual’s performance to be compared with a very large normative database in terms of these descriptive factors rather than in terms of performance on a single test.
Here, the same 12 tests were used to create three factor scores reflecting performance in three cognitive domains (memory, reasoning, and verbal ability) identified by Hampshire and colleagues (2012). The three cognitive-domain scores were calculated using the formula Y = X(Ar+)T, where Y is the resulting N × 3 matrix of domain scores, X is the N × 12 matrix of test z scores, Ar is the 12 × 3 matrix of varimax-rotated principal component weights (i.e., factor loadings) from Hampshire et al., and T means “transpose.” All 12 tests contributed to each domain score, as determined by their component weights. The resulting factor scores (i.e., principal component analysis scores) are standardized (i.e., population M = 0, SD = 1.0), so a score above zero indicates that someone is above average.
Matched sample
Linear regression
To investigate the effect of bilingualism on performance on each test as well as on our three factors, we performed linear regression separately for each of the 15 scores. Models were constructed as follows: bilingualism (monolingual vs. multilingual), SES (below poverty line vs. at or above poverty line), and handedness (left vs. right) were constructed as binary regressors. Education, gender, country of origin, and languages spoken at home were treated as categorical, with n − 1 regressors. Participants’ age (mean-centered across the entire sample) was also included, as was an Age × Group interaction term. This was done to verify that a second well-studied effect could be replicated in this sample and to further test the hypothesis that bilingualism might provide a cognitive protective effect against aging. The regression models were built and estimated using the R packages stats (R Core Team, 2019) and lmSupport (Version 2.9.13; Curtin, 2018). Bayes factor estimates that compared a model including the bilingualism regressor (i.e., the full model) with a model that did not (i.e., the reduced model) were computed using an approximation based on the Bayesian information criterion (BIC) from these two models, as specified by Wagenmakers (2007). This calculation was similarly performed for the age regressor. All statistical tests were corrected for multiple comparisons using a false-discovery rate (FDR) across scores (12 tests and three factors), and separately for each effect (group, age, and Age × Group). Because large sample sizes will inherently produce significant results in some statistical tests, we included measures of standardized and unstandardized effect sizes, confidence intervals, and Bayes factors to put effects into meaningful context. Including the intercept term, the final design matrix contained 22 columns and 744 rows.
Model selection
Following the initial set of linear regressions, we performed model selection to assess whether any effects (or lack thereof) were due to which regressors we chose to include in the model. For this, we used the R package MuMIn (Version 1.43.6; Barton, 2019). Model selection was performed on each of the 12 tests and our three factors as follows. First, the global model was specified, with all predictors including the Age × Group interaction term. Next, models were estimated for every possible nested version of the global model but always with the interaction term, yielding 64 models with unique combinations of regressors. From this, the model with the best fit was selected on the basis of the lowest BIC. We then extracted all parameter estimates and p values for age, group, and the interaction term from each of the models, and we calculated the percentage of models that led to a significant result (p < .05, corrected for multiple comparisons using the FDR). To avoid overcorrection, we calculated the FDR separately for each model variation—that is, for a single iteration of regressors, 15 p values (12 tests and three factors) were extracted and then FDR corrected. This procedure was performed on each of the 64 models. Finally, we determined which regressors were likely to be included in a significant model. Using this methodology, we were able to assess how much the variables included in the model were likely to influence the outcome.
Unmatched sample
We also performed follow-up analyses using the entire unmatched sample (5,994 monolinguals and 5,047 bilinguals) to investigate whether any effects of bilingualism would be observed using a significantly larger, though arguably less controlled, data set. Linear regression models for the 15 scores were constructed just as in the matched sample analysis. With the intercept term, the final design matrix for the global model contained 22 columns and 11,041 rows. The model was then selected in the same manner as specified above in order to determine the set of predictors that led to the highest model fit. FDR correction was again performed separately for each iteration of regressors.
Results
Matched sample
Of the 40,105 participants who registered for the study, 11,213 (age range = 18–87 years) completed all 12 cognitive tasks and all of the questions pertaining to bilingualism, country of birth, SES, and education; 744 participants were included in the final sample after data cleaning and matching were completed (see the Method section). Descriptive information for this subsample is available in Table 1, and distributions including medians, quartiles, and ranges are shown in Figures 1 and 2.

Distribution of scores for both of the demographically matched groups on each of the 12 tests. Medians are indicated by thick black horizontal lines. The first and third quartiles are marked by the lower and upper edges of the boxes, respectively. Lower and upper whiskers extend to the smallest and largest value, respectively, within 1.5 times the interquartile range. Outlying values beyond these ranges are plotted individually.

Distribution of scores for both of the demographically matched groups for the three factors. Medians are indicated by thick black horizontal lines. The first and third quartiles are marked by the lower and upper edges of the boxes, respectively. Lower and upper whiskers extend to the smallest and largest value, respectively, within 1.5 times the interquartile range. Outlying values beyond these ranges are plotted individually.
For each of the 15 scores of cognitive performance, the model including only age, group, and the interaction term provided the best fit; none of them showed a significant group effect (Table 2) or a significant Age × Group interaction (Table 3, Figs. 3 and 4). Bayes factors strongly supported the null hypotheses that there was no effect of group, or Age × Group interaction, for all 15 scores. A single exception to this was the Rotations test, in which the Bayes factor provided anecdotal evidence in support of the Age × Group interaction (though it was still not significant). All tests and factors showed a statistically significant effect of age, except for Odd One Out (see Table S3 in the Supplemental Material).
Bilingualism Regression Parameters for the Best-Fitting Model Following Model Selection in the Matched Sample
Note: BF01 is the Bayes factor showing the likelihood of the null over the alternative hypothesis. CI = confidence interval.
Age × Group Interaction Regression Values for the 12 Cognitive Tasks and Three Factors in the Demographically Matched Sample
Note: BF01 is the Bayes factor showing the likelihood of the null over the alternative hypothesis. CI = confidence interval.

Plots showing the linear relationship between age and scores for each of the tests in the matched sample. For each regression line, a 95% confidence ellipse and a 95% confidence interval is shown. Effect sizes are reported in Table S3 in the Supplemental Material. Individual data points have not been included because of the large sample size.

Plots showing the linear relationship between age and scores for each of the three factors in the matched sample. For each regression line, a 95% confidence ellipse and 95% confidence interval is shown. Individual data points have not been included because of the large sample size.
When examining the distribution of significant p values resulting from the 64 model variations, we found that no test showed a significant group effect in any model. The age term was significant in 100% of models for all tests and factors except Odd One Out. Finally, no test or factor showed a significant Age × Group interaction in any model. Distributions of group-level p values for each test and factor are shown in Figures 5 and 6. Distributions of group-level bilingualism parameter estimates for each test and factor are shown in Figures S3 and S4 in the Supplemental Material.

Distributions of p values for each test over 64 models in the matched sample. The dashed blue line indicates a p value of .05.

Distributions of p values for each factor over 64 models in the matched sample. The dashed blue line indicates a p value of .05.
Unmatched sample
In the unmatched sample, 5,047 participants reported speaking two or more languages, whereas 5,994 participants reported speaking only one language, as outlined in Table S2. On average, the two groups were well matched in terms of gender, χ2(2, N = 11,041) = 3.65, p = .162, and handedness, χ2(1, N = 11,041) = 0.92, p = .338. Bilinguals were younger than monolinguals, t(10832) = 15.38, p < .001, and a larger proportion of them were from high-SES backgrounds, χ2(1, N = 11,041) = 15.10, p < .001. The groups differed in their proportions of levels of education, χ2(4, N = 11,041) = 380.00, p < .001, but the overall pattern did not favor one group or the other (see Table S2). On average, bilinguals reported speaking 2.57 languages (range = 2–9).
Scores on each of the 12 tests and three factors were again submitted to linear regression, and the global model included all regressors and the Age × Group interaction. Distributions including medians, first and third quartiles, and ranges for each test are shown in Figures S1 and S2 in the Supplemental Material.
As shown in Table 4, the set of regressors that provided the best fit differed depending on the test or factor. Regression coefficients of the best-fitting model for each test and factor are summarized in Table 5 (describing the group term) and Table 6 (describing the interaction term). In five tests and two factors, the selected model showed a significant group effect, but only Digit Span showed a bilingual advantage, ΔR2 < .01, β = 0.05, t(11031) = 2.52, p = .029; Grammatical Reasoning, Feature Match, Rotations, and Token Search, and both the Verbal and Reasoning factors, showed a monolingual advantage. Similar to the findings in the matched sample, all tests and factors showed a significant effect of age except for Odd One Out (see Table S4 in the Supplemental Material). No tests or factors showed a significant Age × Group interaction (see Figs. 7 and 8).
Parameters for the Best-Fitting Model in the Unmatched Sample and the Percentage of Significant Results in 64 Model Iterations
Note: BIC = Bayesian information criterion; SES = socioeconomic status.
Group Regression Parameters for the Best-Fitting Model Following Model Selection in the Unmatched Sample
Note: BF01 is the Bayes factor showing the likelihood of the null over the alternative hypothesis. CI = confidence interval.
Interaction Regression Parameters for the Best-Fitting Model Following Model Selection in the Unmatched Sample
Note: BF01 is the Bayes factor showing the likelihood of the null over the alternative hypothesis. CI = confidence interval.

Plots showing the linear relationship between age and scores for each of the tests in the unmatched sample. For each regression line, a 95% confidence ellipse and a 95% confidence interval is shown. Because the groups were not age matched, the monolingual ellipse begins and extends farther right than the bilingual ellipse in each of the plots. Individual data points have not been included because of the large sample size.

Plots showing the linear relationship between age and each of the three factors in the unmatched sample. For each regression line, a 95% confidence ellipse and a 95% confidence interval is shown. Individual data points have not been included because of the large sample size.
When examining the distribution of significant p values resulting from the 64 model variations, we found that eight tests and two factors showed a significant group effect some proportion of the time, depending on the set of regressors (exact percentages are shown in Table 4). Bilinguals showed an advantage in Double Trouble and Digit Span, whereas monolinguals showed an advantage in Feature Match, Rotations, Token Search, and their overall Reasoning factor score. The direction of the advantage varied for Grammatical Reasoning (25% monolingual advantage and 37.5% bilingual advantage) and the Verbal factor score (12.5% monolingual advantage and 75% bilingual advantage), depending on the set of regressors included. The age term was again significant in 100% of models for all tests and factors except Odd One Out. Only Grammatical Reasoning showed a significant interaction in 12.5% of models (Table 5), and in all cases, monolinguals had a steeper decline with age. Distributions of group-level p values for each test and factor are shown in Figures 9 and 10. Distributions of group-level bilingualism parameter estimates for each test and factor are shown in Figures S5 and S6 in the Supplemental Material.

Distributions of p values for each test over 64 models in the unmatched sample. The dashed blue line indicates a p value of .05.

Distributions of p values for each factor over 64 models in the unmatched sample. The dashed blue line indicates a p value of .05.
Although some tests showed significant effects, caution should be used when interpreting tests of significance because of the large sample size, and the focus should be placed on measures such as effect sizes and confidence intervals. The effect sizes indicate that being bilingual explains less than 1% of the variance in all significant results; for example, bilinguals outperformed monolinguals by a standard deviation of 0.05 on Digit Span. Because of the difficulty in interpreting null results, we examined the data further by estimating the BICs for both the full and reduced models, which were subsequently used to calculate the Bayes factor for the full model (Wagenmakers, 2007). We found support for a bilingual advantage only on the tests in which monolinguals showed an advantage, with the BIC for Digit Span (BIC = 168.69) strongly supporting the null hypothesis. A Bayesian analysis of the other eight tasks and factors strongly or decisively supported the null hypothesis, and the data suggest that the pattern of results was more likely to occur if there were no differences between bilinguals and monolinguals (BF01s and effect sizes are reported for all 12 tasks and three factors in Table 5).
Discussion
In this study of 11,041 participants, no reliable differences in executive function were observed between monolinguals and people who reported speaking more than one language. First, when we created matched groups to eliminate confounds that may be masking an executive function advantage in bilinguals, and to ensure that our groups met the criteria for being either monolingual or bilingual, we found no significant group differences. Second, when utilizing the entire (large, though unbalanced) data set, we found that only one task, Digit Span, showed an advantage in performance in bilinguals. Although this result is statistically significant, it is important to put it in perspective: The regression coefficient was 0.05. In real terms, this means that, statistically, speaking a second language is associated with better memory for digits, but that difference is one twentieth of 1 standard deviation. To further put this into context, we note that the standardized effect size (i.e., η p 2) was less than .01, which is well below what is considered small—confirming that this effect was trivial, even if it was statistically significant. Further, though p was below .05, the Bayes factor showed strong support for the null hypothesis, calling the statistical significance into question. In 11 other cognitive tasks and our three cognitive factors, including several that have previously suggested a bilingual advantage, there were either no differences between groups or a positive difference for monolinguals (although these differences had negligible effect sizes).
Another issue that we examined in this study was whether being bilingual protects against age-related cognitive decline (Bialystok, 2017; Perani et al., 2017). The interaction between group (bilinguals vs. monolinguals) and age showed no relationship in both our age-, education-, SES-, and language-matched subgroup and in our full, unmatched sample. Therefore, this study provides no support for such protective effects, even in tests that are sensitive to age-related decline (e.g., Paired Associates and Polygons). Indeed, Bayes factors for all tests showed substantial or strong support for the null hypothesis.
Through model selection, we were able to identify which regressors needed to be included to provide the best fit to the data. This also showed that the set of regressors included in the model can sometimes lead to a significant result; when groups were not well matched, there were a number of combinations of regressors that led to significant bilingual advantages. We highlight that Double Trouble, a test of inhibition that is a variant of the Stroop task (Stroop, 1935) and one of the tasks most frequently used in bilingual-advantage research, showed a significant result in our unmatched sample 50% of the time, depending on the set of regressors included. This suggests that extreme caution in regressor selection must be taken when testing whether bilinguals show cognitive benefits over monolinguals, as spurious results can occur, potentially explaining the large discrepancy in the literature over whether such effects exist.
Despite these results, several potential caveats need to be considered. First, is it possible that the 12 tasks did not assess aspects of cognition that are relevant to a potential bilingual advantage? This is very unlikely, as versions of most of the tests have been used in previous work demonstrating the cognitive benefits of bilingualism. For example, Double Trouble is a version of the Stroop task and a measure of inhibition that has been used extensively in this research area (Bialystok et al., 2008; Blumenfeld & Marian, 2014). Similarly, spatial tasks have been used to show a bilingual advantage (Morales et al., 2013), but none of our spatial tasks had a significant group effect. The battery of tests employed was also cognitively broad; many of the tasks required executive function (Bor et al., 2006; Owen et al., 1990, 1995), and all required aspects of attention and working memory. If there were benefits to any of these processes afforded by bilingualism, it is reasonable to expect that they would be expressed through differences in performance on some, or all, of these tasks. It is of course possible that differences between monolinguals and bilinguals would have been observed if we had used a different set of cognitive tasks entirely, although in the context of the available literature on bilingualism and executive function, it is not at all clear what those tasks would have been, nor what executive processes they would have tapped.
Second, is it possible that the 12 tasks included in the battery are simply not sensitive to the subtle effects of bilingualism? This is also extremely unlikely, because the tasks have previously been shown to be highly sensitive to subtler cognitive differences related to disease or pharmacological intervention. For example, the test of planning (the Hampshire Tree Task) is sensitive to performance differences between specific genotypes in early Parkinson’s disease (Williams-Gray et al., 2007); tests of paired-associates learning, such as the one employed in this study, are able to distinguish between first-episode schizophreniform psychosis and established schizophrenia (Wood et al., 2002); and the Token Search task used here has been used to detect increases in spatial working memory in children with attention-deficit/hyperactivity disorder following a low dose of methylphenidate (Mehta et al., 2000). More importantly, however, the sheer sample size of more than 11,000 participants makes it extremely unlikely that a genuine effect of bilingualism on executive function would have been missed if it were there.
Third, is it possible that the observed results occurred because the two samples were not perfectly matched with respect to age, SES, and education? This is not the case, as the effects of all three factors were controlled by including them as variables of no interest. However, even if this statistical procedure did not adequately control their effects, separate analyses run on an age-, SES-, and education-matched subsample again provided absolutely no evidence for a bilingual advantage, although age effects remained.
Finally, whereas previous studies have shown that online testing produces results that are comparable with those acquired in more traditional lab-based settings (Hampshire et al., 2012), it is possible that inaccurate reporting of demographic information and test scores led to data that were too noisy for differences to emerge. However, when we imposed strict cleaning procedures in the matched subsample, ensuring that our bilingual sample met several criteria for bilingualism, the effects seen in the unmatched sample disappeared completely.
These results demonstrate that, across a broad battery of cognitive tasks of executive function, no systematic differences exist between monolinguals and bilinguals. When groups were poorly matched, a difference on Digit Span was detected, although given the modest size of this effect in terms of the performance advantage it affords and the weak support for the difference, its real world relevance is questionable.
We conclude by emphasizing, however, that despite the fact that no meaningful relationship was found between bilingualism and executive function, the broader social, employment, and lifestyle benefits that are available to speakers of a second language are clearly numerous.
Supplemental Material
sj-pdf-1-pss-10.1177_0956797620903113 – Supplemental material for Bilingualism Affords No General Cognitive Advantages: A Population Study of Executive Function in 11,000 People
Supplemental material, sj-pdf-1-pss-10.1177_0956797620903113 for Bilingualism Affords No General Cognitive Advantages: A Population Study of Executive Function in 11,000 People by Emily S. Nichols, Conor J. Wild, Bobby Stojanoski, Michael E. Battista and Adrian M. Owen in Psychological Science
Supplemental Material
sj-pdf-2-pss-10.1177_0956797620903113 – Supplemental material for Bilingualism Affords No General Cognitive Advantages: A Population Study of Executive Function in 11,000 People
Supplemental material, sj-pdf-2-pss-10.1177_0956797620903113 for Bilingualism Affords No General Cognitive Advantages: A Population Study of Executive Function in 11,000 People by Emily S. Nichols, Conor J. Wild, Bobby Stojanoski, Michael E. Battista and Adrian M. Owen in Psychological Science
Supplemental Material
sj-pdf-3-pss-10.1177_0956797620903113 – Supplemental material for Bilingualism Affords No General Cognitive Advantages: A Population Study of Executive Function in 11,000 People
Supplemental material, sj-pdf-3-pss-10.1177_0956797620903113 for Bilingualism Affords No General Cognitive Advantages: A Population Study of Executive Function in 11,000 People by Emily S. Nichols, Conor J. Wild, Bobby Stojanoski, Michael E. Battista and Adrian M. Owen in Psychological Science
Footnotes
Acknowledgements
This research was supported by the Canada Excellence Research Chairs Program (Grant No. 215063).
Transparency
Action Editor: Charles Hulme
Editor: D. Stephen Lindsay
Author Contributions
E. S. Nichols codesigned the study, analyzed and interpreted the data, and took overall responsibility for writing each draft of the manuscript. C. J. Wild codesigned the study, designed the data checking and cleaning protocols, was responsible for converting data into a format for analysis, and contributed to each draft of the manuscript. B. Stojanoski and M. E. Battista codesigned the study and contributed to each draft of the manuscript. A. M. Owen codesigned the study and contributed to each draft of the manuscript. All authors approved the final version of the manuscript for submission.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
