Abstract
We investigated the extent to which the observed gender differences in mental rotation ability among the 2,468 freshmen studying engineering at a Midwest public university attributed to the gender bias of a test. The Revised Purdue Spatial Visualization Tests: Visualization of Rotations (Revised PSVT:R) is a spatial test frequently used to measure students’ spatial visualization ability in three-dimensional mental rotation in science, technology, engineering, and mathematics fields. With two major approaches for evaluating measurement invariance, we found that five items in the Revised PSVT:R showed a difference in the response pattern by gender, but the impact of these biased items was marginal on the total scores on the scale. Our findings support the equitable use of the Revised PSVT:R by gender for educational research and practices.
Spatial ability refers to “the ability to generate, retain, retrieve, and transform well-structured visual images” (Lohman, 1996, p. 98). Mental rotation ability, a sub-component of spatial ability, involves cognitive processing to mentally rotate visual stimuli, which are often two-dimensional (2-D) or three-dimensional (3-D) objects, toward the directions that were indicated by a comparison stimulus or instruction (Linn & Petersen, 1985; Uttal et al., 2013). Because the literature suggests that spatial ability has shown a positive link to academic and career success, particularly in science, technology, engineering, and mathematics (STEM) fields (e.g., Wai, Lubinski, & Benbow, 2009), and the ability is a prerequisite for developing quantitative reasoning skills, a spatial test often has been used to predict students’ academic success (e.g., Sorby, Casey, Veurink, & Dulaney, 2013). A review of relevant literature also indicated the malleability of spatial skills with appropriate interventions, and discussed the possibility that spatially enriched education may increase the opportunity for participation in STEM disciplines for all students, especially minority and/or female students pursuing careers in STEM fields (Hill, Corbett, & St. Rose, 2010; Uttal et al., 2013).
Although research provides strong evidence that spatial ability is crucial for success in STEM fields, the existence of gender differences in mental rotation ability—with males scoring higher—has widely been supported by meta-analyses (e.g., Linn & Petersen, 1985; Maeda & Yoon, 2013). For example, Linn and Petersen (1985) reported an effect size (i.e., the standardized mean difference between males and females on spatial tasks) of 0.73 favoring males. Recent evidence suggests that this trend has not changed for the last three decades (Maeda & Yoon, 2013). In addition, the gender differences are likely to remain after spatial training, although both females and males tend to improve their performance to the same extent through the intervention (e.g., Uttal et al., 2013).
Although the literature discusses several possible reasons for gender differences in mental rotation ability, Maeda and Yoon (2013) found that employed assessment procedures moderate the magnitude of the gender difference. For example, males’ outperformance on a spatial test tends to increase with a stringent time limit for testing, as opposed to no time limit or relaxed time limit conditions. The procedural impact on gender differences is of particular interest because it suggests the observed gender differences may partially derive from measurement errors or construct-irrelevant factors resulting from employed procedures for measuring spatial ability. Investigation of measurement bias against females on spatial tests is critical because the use of a biased test may explain persistent underperformance of females on spatial tasks, consequently leading to underrepresentation of females in STEM fields. Therefore, investigating a possible gender bias inherent in measuring spatial ability is imperative to support the fair use of the test in educational settings (AERA, APA, & NCME, 2014).
Although various spatial tests were used in STEM research and education, the psychometric evaluations of these tests are limited, particularly regarding fairness of the test or items for making sound educational decisions (Maeda & Yoon, 2013). Given the lack of investigation on psychometric properties of spatial tests—particularly about a potential gender bias—we conducted this study to estimate the extent to which the observed gender differences result from bias in the instrument used for measuring spatial performance. For this investigation, we selected the Revised Purdue Spatial Visualization Tests: Visualization of Rotations (Revised PSVT:R; Yoon, 2011) due to its pervasive use in STEM education research, as well as supporting evidence for high reliability and validity from past research on the instrument (Maeda & Yoon, 2013; Maeda, Yoon, Imbrie, & Kang, 2013; Yoon, 2011).
Method
Data and Data Analysis
The Revised PSVT:R measures the 3-D mental rotation ability of individuals 13 years or older (Yoon, 2011). The Revised PSVT:R contains 30 items consisting of 13 symmetrical and 17 nonsymmetrical 3-D objects that are drawn in a 2-D isometric format. Each item asks a respondent to mentally rotate an object in the same direction as visually indicated in the instructions. The respondent is then asked to select the right answer from five possible response options.
We used archival data of the Revised PSVT:R obtained from 2,468 engineering freshmen (of those, NM = 1,888 [76.5%] were males, NF = 580 [23.5%] were females) who took the spatial test in the fall of 2010 or 2011. Cronbach’s alphas for both gender groups were almost equal: .816 for females and .834 for males.
We chose two major approaches to examine the potential bias and measurement invariance by gender: differential item functioning (DIF) analyses and a multiple-groups confirmatory factor analysis (MCFA). We began with a series of descriptive analyses to examine how observed score distributions differ by gender. Next, we conducted three methods of DIF analyses for convergence of findings across methods: (a) the Mantel–Haenszel (M-H) method, (b) the logistic regression method, and (c) a three-parameter logistic (3-PL) item response model. Because of the large sample size used in the study, we used both statistical and practical significance for identifying a DIF item. We selected these three methods because of their distinctive differences in procedures to identify items that show a DIF. We used DIFAS software (Penfield, 2005, 2012) to run M-H analyses, using the total score on 30 items as a matching variable, and SPSS 22 (2013) to run a series of logistic regression analyses. Although there are a variety of approaches for item response theory (IRT)-based DIF analyses, they tend to produce relatively consistent findings (e.g., Yang et al., 2011). Thus, we selected one approach, that is, a likelihood-ratio test using freeware called IRTLRDIF (Thissen, 2001), in the current investigation. We also conducted a differential test function (DTF) analysis to evaluate whether a set of the items as a whole (or a test) has a bias against gender because bias at an individual item level may have little impact if the test does not show the bias.
Finally, we ran MCFA with the robust weighted least squares (WLSMV) estimators and theta parameterization in Mplus 7.0 (Muthén & Muthén, 1998-2012). Because the Revised PSVT:R items produce a bivariate distribution of responses on each item, WLSMV functions appropriately to generate accurate estimates for dichotomous indicators (Brown, 2006). We first examined the equivalence of factor structure of both gender groups. Then, we evaluated the equivalence of measurement parameters (i.e., factor loadings and thresholds in tandem) using chi-square tests for the difference between restrictive and less restrictive models. Because the result of the chi-square test was significant, we tested partial measurement invariance models by releasing equality constraints for factor loadings and threshold of items with a large modification index, each in turn and together (Brown, 2006; Sass, 2011).
Results and Discussion
As consistent with majority of the extant literature (e.g., Linn & Petersen, 1985; Maeda & Yoon, 2013), we observed gender differences on both raw, t(914.4) = 12.20, p < .01, and IRT-based ability scores, t(964.6) = 12.75, p < .01, on the Revised PSVT:R. On average, male students answered 77% of the items correctly (M = 23.13, SD = 4.97), whereas female students correctly answered 67% of the items (M = 20.11, SD = 5.29). The magnitude of the difference was relatively large (Hedges’ g = .60), which is similar in size (Hedges’ g = .57) reported in the meta-analytic study by Maeda and Yoon (2013).
Although most items listed in Table 1 showed minor DIF, Items 6 and 14 showed substantial gender bias. For Item 6, the M-H chi-square test was significant and an index by Educational Testing Service (ETS; Zwick, 2012) is a C, indicating large DIF. The results of logistic regression and IRT-based results are congruent to support uniform DIF. Items 6 and 14 are moderately easy; 82.2% of respondents (N = 2,468) answered Item 6 correctly, and 78.3% answered Item 14 correctly. These items seem to be more difficult for females than males with the same ability level. These two items only showed differences in item difficulty, but neither item discrimination nor guessing parameters were statistically significant. However, the result of DTF analysis indicates that the weighted variance of DIFs across 30 items is 0.05, suggesting that identified DIF items on the total test function might be negligible.
The Summary of the Items Identified by Three DIF Analyses.
Note. DIF = differential item functioning; M-H = the Mantel–Haenszel method; IRT = Item response theory (IRT)-based method; LOR= log-odds ratio; LOR (Z) = standardized log-odds ratio; CDR= combined decision rule; ETS= The Educational Testing Service (ETS) Categorization Scheme (Zwick, 2012).
Table 2 shows the results of MCFA analyses. Overall fit statistics for the one-factor structure supports a good model fit for both gender groups, and the data from the women showed better fit indices than men. The chi-square difference test between the two models was significant and the modification indices suggested that 11 out of 30 items (including Items 6 and 14) might be attributed to non-invariance. However, when measurement parameters of three items (Items 13, 15, and 16 with the largest modification indices) were freely estimated for partial invariance, the chi-square test yielded a non-significant result.
Tests of Measurement Invariance of the Revised PSVT:R by Gender.
Note. Revised PSVT:R = Revised Purdue Spatial Visualization Tests: Visualization of Rotations; RMSEA = root mean square error of approximation; CFI = comparative fit index; TLI = Tucker–Lewis index.
p < .001.
In practice, having measurement parameter invariance across groups on all the items in an instrument is rare (Schmitt & Kuljanin, 2008). Furthermore, current literature has not reached a consensus on the guideline for the degree of acceptable range of partial measurement invariance (Schmitt & Kuljanin, 2008). Because MCFA heavily relies on chi-square tests to evaluate measurement invariance for items, even small differences will reach statistical significance with large sample sizes. This might be the situation for the current study, as our sample size (N = 2,468) provides considerable power for the chi-square test. It seems that retaining the three items does not threaten the content and construct validity of the Revised PSVT:R. If there is a threat, it should be minimal (Byrne, Shavelson, & Muthén, 1989), because gender differences in average factor scores remain after statistically controlling the functional differences in these items.
To verify the conclusion, we conducted a small simulation study with 1,000 replications to examine the extent to which female average scores differ under the two conditions: (a) IRT item parameters of Items 6, 13, 14, 15, and 16 show DIF as reported in Table 1 (the “biased” condition) and (b) item parameters of these items are equalized, so there is no DIF by gender (the “unbiased” condition). The result showed that the average difference between the biased and unbiased female group means was 0.07 (SD = 0.14), and the mean comparison by a t test was not significant for all 1,000 replications. Thus, the observed gender differences on Revised PSVT:R scores were not affected by the item bias.
Although our findings offer support for the equitable use of the Revised PSVT:R for gender comparison, a substantial question remains: Why do some items function differently by gender? Table 3 summarizes the characteristics of these five items along with the percent correct and factor loading by gender. These items have little in common regarding item characteristics, including the shape of objects and rotations involved. Therefore, further investigation may be necessary to scrutinize how cognitive processes would differ by gender as functions of item features in experimental settings (i.e., the shape of the object, the direction and angle of rotation, the complexity of rotating tasks [single vs. multiple rotations], etc.). Characteristics of the gender-DIF items reported in Table 3 should provide insights into such investigations. This line of research will also advance the exposition of gender differences in mental rotation ability and may contribute to identifying the source of gender differences.
Characteristics of Items That Showed DIF by Two Evaluation Approaches.
Note. DIF = differential item functioning.
Furthermore, because we used the archival data obtained from freshmen in engineering at a public university in the Midwest, we acknowledge that the data do not represent the college population in general. Therefore, although gender differences with higher average scores by males tend to be observed in different majors (Yoon, 2011), caution is required when generalizing our results to other college populations. Finally, the current investigation only concludes that the Revised PSVT:R is unbiased for gender comparison. Further evaluation of measurement invariance across different subgroups of the population will promote the equitable use of the Revised PSVT:R, particularly for high-stakes decisions such as assignment to a remedial course.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
