Abstract
Studies have shown that female children, on average, consistently outperform male children in arithmetic. In the research reported here, 1,556 pupils (8 to 11 years of age) from urban and rural regions in the greater Beijing area completed 10 cognitive tasks. Results showed that girls outperformed boys in arithmetic tasks (i.e., simple subtraction, complex multiplication), as well as in numerosity-comparison, number-comparison, number-series-completion, choice reaction time, and word-rhyming tasks. Boys outperformed girls in a mental rotation task. Controlling for scores on the word-rhyming task eliminated gender differences in arithmetic, whereas controlling for scores on numerical-processing tasks (number comparison, numerosity estimation, numerosity comparison, and number-series completion) and general cognitive tasks (choice reaction time, Raven’s Progressive Matrices, and mental rotation) did not. These results suggest that girls’ advantage in arithmetic is likely due to their advantage in language processing.
Keywords
There is an extensive literature on gender differences in mathematical performance. Boys have been shown to outperform girls in mathematical problem solving (e.g., Benbow & Stanley, 1983; Geary, 1996; Hyde, Fennema, & Lamon, 1990; for reviews, see Hyde, Lindberg, Linn, Ellis, & Williams, 2008; Zhu, 2007), but girls generally outperform boys in arithmetic (e.g., Linn & Hyde, 1989; Willingham & Cole, 1997; but see Fennema, Carpenter, Jacobs, Franke, & Levi, 1998, and Hyde et al., 1990, for some exceptions). Boys’ advantage in mathematical problem solving has been attributed to their superior spatial abilities (e.g., Casey, Nuttall, & Pezaris, 1999; Geary, 1996). The reasons for girls’ advantage in arithmetic, however, are not yet understood.
One plausible reason for this advantage is that girls have an early advantage in verbal abilities (e.g., Burman, Bitan, & Booth, 2008; Hyde & Linn, 1988; Maccoby & Jacklin, 1974). Maccoby and Jacklin (1974) remarked that “female superiority on verbal tasks has been one of the more solidly established generalizations in the field of sex differences” (p. 75). Compared with boys, girls experience an earlier onset of verbal ability and faster vocabulary acquisition, have better reading skills, use more word roots, and speak in longer utterances (Bornstein, Haynes, Painter, & Genevro, 2000; Roulstone, Loader, Northstone, & Beveridge, 2002), and girls’ advantage in reading skills appears to be consistent throughout the high-school years (Mann, Sasanuma, Sakuma, & Masaki, 1990). The risk of language disorders also varies by gender. Flannery, Liederman, Daly, and Schultz (2000) concluded that language and reading disorders are approximately twice as common in boys as they are in girls.
Arithmetic performance is highly dependent on language processing (e.g., Dehaene, Spelke, Pinel, Stanescu, & Tsivkin, 1999). For example, exact calculation involves more language processing than does approximate calculation (Dehaene et al., 1999). Arithmetic calculation elicits greater activations in language-processing areas (including the left superior temporal gyrus, the left precentral gyrus, the left inferior frontal gyrus, and the motor area) than does mathematical problem solving (Lu et al., 2009). Because of the importance of language processing in arithmetic, it is plausible that girls’ advantage in language processing explains their advantage in arithmetic. We tested this hypothesis in the current study.
First, we tested for gender differences in simple calculation (i.e., subtraction using numbers smaller than 20) and complex calculation (i.e., multiplication involving double-digit factors; e.g., 73 × 2). We expected that girls would outperform boys in both tasks.
Second, we examined gender differences in arithmetic performance after statistically controlling for each of the following factors separately: basic numerical processing (including numerosity comparison, numerosity estimation, and number comparison), number-series completion, word rhyming, reaction time, mental rotation, and Raven’s Progressive Matrices. We used a word-rhyming task to assess language processing because it measured a critical component of language ability: phonological awareness (Anthony et al., 2002; Bradley & Bryant, 1983). We expected that we would find no significant differences in arithmetic performance after controlling for performance on this task.
Finally, in order to investigate whether performance on all tasks (i.e., both the general cognitive tasks and the numerical-processing tasks), with the exception of the word-rhyming task, could account for gender differences in arithmetic, we conducted an analysis that included performance on all tasks except for word rhyming as covariates. We expected that gender differences would remain significant after we controlled for scores on the general-ability and numerical-processing tasks.
Method
Participants
Third- to sixth-grade pupils from 12 primary schools in the greater Beijing area of China (N = 1,556; 803 males, 753 females; 8–11 years old) participated in this study. Six schools were urban schools, and 6 were rural schools. Four classes of pupils (one class per grade; approximately 30 to 40 children per class) were randomly selected from each school. All participants were native Chinese speakers and had normal or corrected-to-normal eyesight. The study was approved by the Institute of Cognitive Neuroscience and Learning at Beijing Normal University, the administrative departments of education in the relevant counties, and the principals of the schools.
Procedure
Tests were administered to each class of students in a computer classroom. Each class was monitored by either six or seven experimenters (4 to 6 children per experimenter) as well as the teacher of that class. Instructions were given and a practice session was completed before each formal test. The tasks were administered in the same order for all students. For 9 of the 10 tasks, the children indicated their responses by pressing one of two keys on a computer keyboard; for the numerosity-estimation task, they entered a numerical value. Students’ responses were automatically recorded and sent over the Internet to a server located in our laboratory at Beijing Normal University. All data were collected between November 12, 2009, and December 24, 2009.
The practice session for each task consisted of either four or six trials, which were similar to those used in the formal test. The computer provided the child with feedback after each practice trial. For 9 of the 10 tasks, feedback for correct responses was “Correct! Can you go faster?” and feedback for incorrect responses was “It is wrong. Try again.” For the numerosity-estimation task, in which the children had to estimate the number of dots in a dot array, the feedback was the correct number of dots. The children could ask experimenters any questions they had during the practice session. After all the children in a class had finished the practice session and had no more questions for the experimenters, the main experimenter said, “Start,” and the children pressed any key to begin the formal test.
Tasks
All the tasks were programmed using Web-based applications available at www.dweipsy.com. For each of the time-limited tasks (i.e., simple subtraction, complex multiplication, number-series completion, mental rotation, and Raven’s Progressive Matrices), we calculated scores by subtracting the number of incorrect responses from the number of correct responses. For each of the timed tasks (i.e., choice reaction time, numerosity comparison, number comparison, and word rhyming), we calculated each participant’s median reaction time and error rate. For the numerosity-estimation task, we calculated the absolute value of the deviation of the estimated quantity from the correct quantity for each trial and then calculated the mean of the absolute values. All tests appeared to have acceptable reliability (see Table 1).
Mean Results and Gender Differences for All Tasks
Note: Reaction times (RTs) are given in milliseconds. Gender was coded as 0 for male and 1 for female. Standard deviations are shown in parentheses.
p < .01. ***p < .001.
Simple subtraction
For all 92 simple-subtraction problems (e.g., 6 − 2, 17 − 8), the minuends were 18 or smaller, and the differences were single-digit numbers. Two candidate answers were presented beneath each problem. Participants were asked to press the “Q” key to choose the answer on the left and the “P” key to choose the answer on the right. For this task, each incorrect candidate answer was within the range of the correct answer plus or minus 3 (i.e., ±1, ±2, or ±3). The children were allotted 2 min to complete this task.
Complex multiplication
All 76 problems in the multiplication task involved one double-digit number multiplied by one single-digit number (e.g., 67 × 9). Every problem required carrying. Four candidate answers were presented beneath each problem: the correct answer and three incorrect candidate answers (i.e., the correct answer plus or minus 1, the correct answer plus or minus 10, and the correct answer plus or minus 100). The other aspects of the procedure (stimuli presentation, method of responding) were the same as those for the simple-subtraction task. Children were allotted 2 min to complete this task.
Numerosity comparison
The numerosity-comparison task was adapted from the second edition of the Test of Early Mathematics Ability (Ginsburg & Baroody, 1990). Two sets of dots of varying sizes were presented simultaneously on the screen, and participants were asked to judge which dot array contained more dots while ignoring the sizes of individual dots. Participants pressed “Q” if they thought the array on the left contained more dots and “P” if they thought the array on the right contained more dots. The number of dots in each set varied from 5 to 12. The total combined area of all dots in each set was controlled to be the same. The test consisted of 36 trials.
Numerosity estimation
The numerosity-estimation task was adapted from Krueger (1982). Participants were asked to quantify the number of dots in an array of 11 to 99 dots. Each dot array was presented in the middle of the screen for 1,000 ms. Participants entered a value into a box at the bottom of the screen. After each trial, participants received feedback showing the correct number of dots and were asked to use the feedback to improve their estimation on subsequent trials. The test consisted of 27 trials.
Number comparison
The number-comparison task was adapted from a Stroop-like number-comparison task used in previous research (Girelli, Lucangeli, & Butterworth, 2000; Zhou, Chen, Chen, et al., 2007). Eighty-four pairs of single-digit Arabic numbers of varying sizes were presented in a random order. For each pair, participants were asked to decide which number was larger in numerical magnitude, while ignoring the differences in the physical size of the numbers. Participants pressed the “Q” key to choose the answer on the left and the “P” key to choose the answer on the right. The magnitude of the numbers could be congruent, incongruent, or neutral with respect to the physical sizes of the numbers (e.g., if the pair of numbers was “3-8,” the 3 could be physically smaller than the 8, physically larger than the 8, or the same size as the 8, respectively). In pairs of differently sized numbers, the ratio of the physical size of the two numbers was 1:2. The number-comparison test consisted of three sessions (28 trials per session), separated by two 30-s rest periods.
Number-series completion
The number-series-completion test was adapted from the Cognitive Abilities Test 3 (Smith, Fernandes, & Strand, 2001). A series of numbers was presented in the middle of the screen. Participants were asked to detect the pattern of these numbers and deduce the next number in the series. For example, the series of numbers “2, 4, 6, 8” would be followed by 10. Two candidate answers were presented beneath each number sequence. Participants pressed the “Q” key to choose the answer on the left and the “P” key to choose the answer on the right. The children were allotted 4 min to complete this test.
Choice reaction time
On each trial of the choice reaction time test, a white dot was presented on a black screen, either to the left or to the right of a fixation cross. The position of the dot was within 15° of visual angle from the cross. Participants were asked to press the “Q” key if the dot appeared on the left and the “P” key if it appeared on the right. There were 30 trials in total (15 trials with the dot on the left and 15 trials with the dot on the right); the size of the screen on which the dot appeared varied randomly across trials. Interstimulus intervals varied randomly between 1,500 ms and 3,000 ms.
Word rhyming
The word-rhyming task consisted of 40 trials and was similar to the task used by Tan et al. (2001). Two Chinese characters were presented simultaneously on the screen. Participants were asked to indicate whether or not the two characters rhymed by pressing the “Q” key for rhyming pairs and the “P” key for nonrhyming pairs. The stimuli remained on the screen until participants responded or 4 s had lapsed.
Mental rotation
The mental rotation task was based on the mental rotation task used by Shepard and Metzler (1971). On each trial, one three-dimensional image was presented on the upper part of the screen, and two more were presented on the lower part of the screen. Participants were asked to choose which image from the bottom of the screen matched the image at the top; the matching image could be identified only by mental rotation. Participants pressed the “Q” key to choose the image on the left and the “P” key to choose the image on the right. The mental rotation test consisted of 180 trials; participants were allotted 3 min to complete it. The rotation angles of the matching images ranged from 15° to 345°, in intervals of 15°. On each trial, the stimuli remained on the screen until the participant responded by pressing the “P” or the “Q” key.
Raven’s Progressive Matrices
The Raven’s Progressive Matrices test (Raven, 1998) was used to assess general intelligence. For this task, participants had to identify the missing segment that would complete a figure’s inherently regular pattern. On each trial, two segments were presented side by side beneath an incomplete figure; participants were instructed to press “Q” if the figure’s missing segment was on the left and “P” if it was on the right. The test consisted of 80 trials; participants were allotted 3 min to complete it.
Data analysis
Because we used a large sample of students from 48 classes, it was necessary to first determine whether the nested data needed to be analyzed with multilevel models. We used the unconditional-means model to compute the intraclass correlation coefficients (ICCs; Peugh & Enders, 2005). The ICCs were .22 for subtraction and .37 for multiplication, suggesting significant between-class variability. We therefore conducted multilevel modeling by using the MIXED procedure in SPSS for all analyses. The following equation was used for Level 1:
In this equation, score ij is the simple-subtraction or complex-multiplication score for participant i in class j, and β0 j is the mean score for class j. β1 j , β2 j , and β3 j are the slopes of age, gender, and covariates (i.e., scores for various tests) predicting the score within class j; γ ij is the random component of the score for participant i in class j.
The following equations were used for Level 2:
In these equations, β0 j is the mean score for class j, γ00 is the grand mean score across all classes, γ01 is the slope of the Level 2 residence variable (urban vs. rural) predicting the mean score for class j, and µ0 j is the random component of the mean score for class j; β1 j , β2 j , and β3 j are the slopes of age, gender, and covariates predicting the mean score for class j.
The combined equation was as follows:
Results
Table 1 shows the mean scores and standard deviations for all tasks and the coefficients for gender from the multilevel models. The left panels of Figures 1 and 2 show the mean scores for simple subtraction and complex multiplication, respectively, by gender, age, and area of residence (without controlling for other covariates).

Average number of correct trials (out of 92 trials) in the simple-subtraction task. The graph on the left shows the raw data; the graph in the middle shows results after controlling for performance on all other tasks except the word-rhyming task; and the graph on the right shows results after controlling for performance on the word-rhyming task only. Error bars indicate 95% confidence intervals.

Average number of correct trials (out of 76) for the complex-multiplication task. The graph on the left shows the raw data; the graph in the middle shows results after controlling for performance on all other tasks except the word-rhyming task; and the graph on the right shows results after controlling for performance on the word-rhyming task only. Error bars indicate 95% confidence intervals.
Older children significantly outperformed younger children on all tasks. None of the interactions involving gender and arithmetic performance (i.e., performance on the simple-subtraction and complex-multiplication tasks) were significant. Girls outperformed boys in arithmetic (i.e., simple subtraction, complex multiplication) as well as in numerosity comparison, number comparison, number-series completion, choice reaction time, and word rhyming. Boys outperformed girls in mental rotation.
Urban children’s performance was superior to that of rural children on several tasks—simple subtraction: b = −2.55, t(37.17) = −2.40, p = .02; numerosity estimation: b = −0.89, t(41.36) = −2.43, p = .02; number-series completion: b = −1.52, t(37.25) = −3.95, p < .0001; mental rotation: b = −1.53, t(38.09) = −2.84, p = .007; Raven’s Progressive Matrices: b = −2.88, t(42.27) = −5.12, p < .0001, and word rhyming: b = 185.89, t(44.14) = 3.02, p = .004, for reaction times and b = 0.09, t(40.92) = 4.91, p < .0001, for error rates. Urban children’s reaction times on the choice reaction time task were also superior to those of rural children, b = 32.09, t(41.72) = 2.43, p = .02.
Table 2 shows the intertask correlations. With few exceptions (e.g., word-rhyming reaction times), correlations were significant and in the expected direction. Most notably, among all tasks other than the arithmetic tasks, error rates for word rhyming had the highest correlations with simple subtraction and complex multiplication. Figures 3 and 4 show the distribution of language ability, the distribution of arithmetic performance, and their relationship by gender. The moderate correlations (approximately −.50) between arithmetic performance (both subtraction and multiplication) and language ability were similar for boys and girls. Forward stepwise regression showed that word rhyming was the most powerful predictor of arithmetic performance (for more details about and results from the stepwise regression analysis, see the Supplemental Material available online).
Correlations Among All Performance Variables
Note: RT = reaction time.
p < .05. **p < .01.

Scatter plot (with best-fitting regression lines) showing scores on the simple-subtraction task as a function of the proportion of errors on the word-rhyming task. Results are shown separately for boys (r = −.51, p < .0001) and girls (r = −.50, p < .0001).

Scatter plot (with best-fitting regression lines) showing scores on the complex-multiplication task as a function of the proportion of errors on the word-rhyming task. Results are shown separately for boys (r = −.47, p < .0001) and girls (r = −.49, p < .0001).
Multilevel model analysis showed that after controlling for performance (both error rates and reaction times) on the word-rhyming task, there were no gender differences in performance on the simple-subtraction and complex-multiplication tasks (see Table 3 and the right panels of Figs. 1 and 2). Controlling for performance on the number-series-completion task eliminated gender differences for simple subtraction but not for complex multiplication. No other covariates accounted for gender differences in arithmetic performance. The results of this multilevel model analysis are displayed in Table 3. We further examined whether gender differences in word rhyming could explain gender differences in other tasks. Results showed that after word-rhyming performance was controlled for, boys’ and girls’ performance was similar on the symbolic number-related tasks, including number comparison, b = −0.003, t(1541) = −1.08, p = .28, and number-series completion, b = 0.11, t(1533) = 0.58, p = .57. However, gender differences remained for tasks that did not involve symbolic numbers, including choice reaction time, b = −0.01, t(1540) = −2.58, p = .01; numerosity comparison, b = −28.85, t(1519) = −3.30, p = .001; and mental rotation, b = −1.94, t(1548) = −4.94, p < .0001.
Results From Multilevel Modeling Showing Gender Differences in Simple-Subtraction and Complex-Multiplication Performance
Note: Gender was coded as 0 for male and 1 for female.
p = .055. *p < .05. **p < .01. ***p < .001.
Finally, to investigate whether performance on all tasks (i.e., both general cognitive tasks and numerical-processing tasks) except for word rhyming could account for gender differences in arithmetic performance, we conducted analyses that included performance on all tasks except for word rhyming as covariates. Results from these analyses of performance on the simple-subtraction task and the complex-multiplication task, respectively, are shown in the middle panels of Figures 1 and 2. Gender differences remained significant for both simple subtraction, b = 0.80, t(1522) = 2.29, p = .022, and complex multiplication, b = 0.55, t(1496) = 2.20, p = .028.
Discussion
Our goal in this study was to examine whether gender differences in children’s arithmetic performance could be accounted for by gender differences in language abilities. Our results showed that gender differences in arithmetic were significant and favored girls. Controlling for scores on the word-rhyming task, however, eliminated gender differences in arithmetic performance. In contrast, scores on the basic numerical-processing tasks (number comparison, numerosity estimation, and numerosity comparison) and general cognitive tasks (choice reaction time, Raven’s Progressive Matrices, and mental rotation) did not account for gender differences in arithmetic performance. Finally, controlling for performance on the number-series-completion task eliminated the gender gap for performance on the simple-subtraction task, but not the complex-multiplication task. Gender differences were generally consistent across age groups and areas of residence (urban vs. rural; i.e., there were few interactions between gender and age group or gender and area of residence). These results suggest that girls’ advantage in arithmetic was likely due to their advantage in language processing, rather than an advantage in basic numerical processing or particular cognitive abilities.
Language processing has previously been shown to be involved in arithmetic processing. In an earlier review of behavioral studies, Aiken (1971) concluded that verbal processing is associated with arithmetic performance. More recent imaging studies have further shown that exact calculation (e.g., addition and subtraction) involves more verbal processing than does approximate calculation (e.g., Dehaene et al., 1999). Similarly, Fedorenko, Gibson, and Rohde (2007) showed that language processing was closely related to arithmetic processing because the two types of processing share working memory resources. Moreover, in children, dyscalculia is often accompanied by reading difficulties (Jordan, Hanich, & Kaplan, 2003; Landerl, Bevan, & Butterworth, 2004). There is also evidence that verbal skills are more important for arithmetic (especially mental arithmetic) performance than are other cognitive skills, such as spatial skills (e.g., Solan, 1987).
Consistent with this literature is our finding that performance on the language-processing (word-rhyming) task, among all tasks, was most closely linked to arithmetic performance. More important, our results suggest that gender differences in verbal processing are responsible for gender differences in arithmetic performance. This link might be due to differences in the ways in which boys and girls acquire arithmetic facts. Fennema et al. (1998) showed that beginning in the first grade, girls prefer to use concrete strategies, such as counting or modeling, whereas boys tend to use abstract strategies. The use of concrete strategies, in addition to conceptual knowledge, can improve performance in arithmetic (Siegler & Shrager, 1984).
We found that after controlling for scores on the number-series-completion task, gender differences in simple-subtraction performance became marginal, but gender differences in complex-multiplication performance remained significant. One possible explanation for this finding is that the verbal processing required for simple subtraction is similar to that required for number-series completion, but that complex multiplication involves more verbal processing than either simple subtraction or number-series completion. Previous research has shown that simple multiplication (i.e., multiplication using only single-digit factors) involves more verbal processing than does addition or subtraction (e.g., Zhou, Chen, Zang, et al., 2007; Zhou et al., 2006). On the basis of these findings, we inferred that complex multiplication should involve even more verbal processing than simple multiplication because it involves more verbal representation of multiplication facts.
Several limitations of the current study should be noted. First, we included only one verbal-processing task in our study. Although performance on this task accounted for gender differences in arithmetic performance, a broader array of verbal tasks would have allowed for finer analyses of the links between verbal and arithmetic processing. Second, our sample consisted of primary-school children, so our results may not generalize to children in other age groups. It is not clear whether the continued advantage in verbal processing among older female students (e.g., in high school and college) is linked to their performance in mathematics beyond arithmetic. Empirical evidence has shown that girls lag behind boys in high-school mathematics, even though their language skills are superior (Fleischman, Hopstock, Pelczar, & Shelley, 2010). Finally, it should be emphasized that, although the gender differences we found were consistent across age groups and between students from urban and rural areas, their magnitude was modest. Our results should therefore be interpreted cautiously.
In summary, numerous studies have demonstrated girls’ advantages in verbal processing (e.g. Bornstein et al., 2000; Burman et al., 2008; Hyde & Linn, 1988; Maccoby & Jacklin, 1974). The current study shows that such advantages lead to girls’ superior performance in arithmetic, because of the important role verbal processing plays in arithmetic performance.
Footnotes
Acknowledgements
The authors thank Li Liu for providing materials for the word-rhyming task and Hongyun Liu and Chih-Chien Yang for their help with statistical analyses. The authors also thank the editors and two anonymous reviewers for their comments on this manuscript.
The authors declared that they had no conflicts of interest with respect to their authorship or the publication of this article.
This research was supported by grants from the
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
