Abstract
This study investigated the factorial invariance of the Taiwan Wechsler Intelligence Scale for Children, Fifth Edition (WISC-V) across age and gender. A higher order five-factor model was tested on a nationally representative sample of 1,034 children aged 6–16 years. The results demonstrated full factorial invariance for Taiwan children of different ages and gender. The WISC-V subtests demonstrated the same underlying theoretical latent constructs, strength of relations among factors and subtests, validity of each first-order factor, and communalities, regardless of age and gender, which supported the same interpretive approach of the WISC-V. These results accord with findings in the United States, indicating a full factorial invariance of the WISC-V five-factor structure across ages and gender.
The Wechsler Intelligence Scales are among the most widely used instruments worldwide for their psychometric properties and practical relevance (Archer et al., 2006; Benson et al., 2019; Bowden, 2013; Camara et al., 2000; Georgas et al., 2003; Groth-Marnat & Wright, 2016; Niileksela & Reynolds, 2019; Rabin et al., 2005). Wechsler Intelligence Scales are frequently used for psychological and educational assessments (Flanagan & Alfonso, 2017; Sattler et al., 2016; Weiss et al., 2016, 2019). This wide use relies on the assumption that Wechsler Intelligence Scale scores have the same meaning for examinees in various subpopulations, such as age, gender, clinical status, or culture. Therefore, investigating the measurement invariance of Wechsler Intelligence Scales is crucial.
Invariance is a fundamental property. Lack of evidence for measurement invariance hinders the ability of the measure to be used in comparisons between groups (American Educational Research Association et al., 2014). The Wechsler Intelligence Scale for Children–Fifth Edition (WISC-V; Wechsler, 2014a) is the latest edition of the Wechsler tests of child intelligence, and the adapted Taiwanese WISC-V was recently published (Wechsler, 2018a). The WISC-V includes considerable changes from the previous version. The most substantial modification is the new five-factor scoring framework, with Index scores for Verbal Comprehension (VCI), Visual Spatial (VSI), Fluid Reasoning (FRI), Working Memory (WMI), and Processing Speed (PSI). There is considerable evidence to support this five-index structure (Reynolds & Keith, 2017; Wechsler, 2014b, 2018b; Weiss et al., 2013a, 2013b), although a four-factor structure, with VSI and FRI subtests combined, have also been supported (e.g., Canivez et al., 2017). The consistency of measurement of this new structure across various subpopulations warrants investigation (Canivez & Watkins, 2016).
Among all possible subpopulations, age and gender are considered fundamental for measurements in various domains (Byrne et al., 1993; Cheng & Watkins, 2000; Emerson et al., 2017; Shogren et al., 2018). Studies frequently combine data from examinees of different ages and gender, but nonetheless, age and gender invariance are essential issues pertaining to WISC-V applicability.
Thus far, studies in the U.S. population have suggested that WISC-V constructs were measured equivalently across ages (Reynolds & Keith, 2017). Two reports were identified that addressed whether the WISC-V constructs were measured equivalently across genders, one from the United States and one from Germany. Both studies tested the WISC-V higher order five-factor structure in large, nationally representative samples. In 2015, Chen, Zhang, Raiford, Zhu, and Weiss reported full factorial invariance across genders in the United States, however, only partial factorial invariance was shown in Germany (Pauls et al., 2019). Pauls et al. (2019) reported invariance across genders on factor pattern, factor loadings, residuals, and disturbances but subtest intercept invariance was only established for 11 of 15 subtests. Given the assumed same score on latent factors, estimated subtest intercepts in German boys were slightly higher for subtests Information and Figure Weights, but lower for Coding and Cancelation. Given these discrepancies in findings, research in other cultures is needed for further clarification.
This is the first study from Asia to investigate the degree of age and gender invariance in the WISC-V. We evaluated whether the WISC-V subtests measured latent abilities in the same manner for children of different ages and genders in Taiwan. Furthermore, we considered how the results from Taiwan compared with findings from other nations.
Method
Participants
We analyzed the most updated Taiwan WISC-V standardization responses from 1,034 children (539 boys and 495 girls). This nationally representative sample was divided into 11 age groups from ages 6 to 16 years, with 94 children in each age group. To investigate age invariance we grouped the 1,034 children into four age bands: 6–8 years (n = 282), 9–11 years (n = 282), 12–14 years (n = 282), and 15–16 years (n = 188). This age grouping was chosen to reflect both the developmental stages and the appropriate sample size needed for reliable confirmatory factor analysis. In Taiwan children aged 6–11 years are in elementary school; aged 12–14 years are in junior high school; and aged 15–16 years are in senior high school. The standardization sample was carefully selected to match the 2017 census of Taiwan on geographic region, gender, and parental education level. A detailed description of this sample is provided in the Taiwan WISC-V manual (Wechsler, 2018b).
Instrumentation
The Taiwan WISC-V has the same 10 primary subtests and six secondary subtests for IQ assessment as the U.S. version. The 10 primary subtests are Similarities (SI), Vocabulary (VC), Block Design (BD), Visual Puzzles (VP), Matrix Reasoning (MR), Figure Weights (FW), Digit Span (DS), Picture Span (PS), Coding (CD), and Symbol Search (SS). The six secondary subtests are Information (IN), Comprehension (CO), Picture Concepts (PC), Arithmetic (AR), Letter–Number Sequencing (LN), and Cancelation (CA). All composites and subtests have demonstrated suitable reliability, with average internal consistency reliability estimates ranging from .85 to .96 for composites and .72 to .91 for subtests (Wechsler, 2018b, p. 93). We employed all 16 subtests to investigate latent abilities (Keith et al., 2016), and for proper comparison with previous findings.
Data Analysis
The tests employed to measure invariance were based on the analysis of covariance structure models using LInear Structural RELationships (LISREL, version 8.8; Jöreskog & Sörbom, 2006). The normality assumption of each subtest was verified. In all four age bands, univariate skewness ranged from −.73 to .52 and kurtosis ranged from −.69 to 1.76. In boys and girls, skewness ranged from −.42 to .03 and kurtosis ranged from −.53 to .61. Maximum likelihood estimation was used for model estimation (Hu & Bentler, 1998; West et al., 1995).
Before invariance analysis, we tested the corresponding five-factor baseline model for each age and each gender group. The five-factor structure reported in the WISC-V Technical and Interpretive Manual (Wechsler, 2014b, p. 83) was used as the hypothesized baseline model. This baseline model specified a higher order g and five first-order factors. The Arithmetic subtest was allowed to be cross-loaded on the FRI, WMI, and VCI factors. We recognize that other models are possible and have been supported in the literature (e.g., Canivez et al., 2017; Reynolds & Keith, 2017). The model used here is the one that guided the scoring of the Taiwan WISC-V, however, and thus seems an important model for invariance testing. Future research may compare this with other plausible models. This five-factor structure is illustrated in Figure 1. Final standardized estimations on the 16 subtests. Note. Chi-square = 318.28, df = 97, p value = .00000, RMSEA = 0.047. VCI = Verbal Comprehension Index; SI = Similarities; VC = Vocabulary; IN = Information; CO = Comprehension; VSI = Visual Spatial Index; BD = Block Design; VP = Visual Puzzles; FRI = Fluid Reasoning Index; MR = Matrix Reasoning; FW = Figure Weights; PC = Picture Concepts; AR = Arithmetic; WMI = Working Memory Index; DS = Digit Span; PS = Picture Span; LN = Letter-Number Sequencing; PSI = Processing Speed Index; CD = Coding; SS = Symbol Search; CA = Cancelation; RMSEA = root mean square error of approximation.
Factorial invariance was examined by testing six levels of nested models (Keith, 2019; Meredith, 1993; Vandenberg, 2002; Wicherts & Dolan, 2010). Different authors suggest different orders for testing invariance; here we have used a variation of the steps suggested by Keith (2019) to test the construct validity of the studied WISC-V scoring model. 1
The initial level was configural invariance, which assumed the same number of factors and overall factor patterns across groups. The second level was first-order factor-loading invariance (or metric/weak invariance). When factor loadings are equal, the scales of the latent variables are the same and the unit of measurement is identical. For each unit change in the latent variable, the scores on the subtests increase by the same amount for all groups. The third level was intercept invariance (or scalar/strong invariance). For this level, it is assumed that subtests have the same intercepts across groups, while allowing differences in latent factor means. If achieved, this level of invariance shows that the scales start at the same point. The subtests within factors are no more difficult for one group versus another. Instead, all subtest mean differences across groups are the result of differences in the factor means. The fourth level was residual (or strict) invariance. Residuals are a combination of subtest-specific unique variance and measurement error. If strict invariance is achieved it can be assumed that all group differences in the measured variables are completely explained by group differences in the latent factors. The fifth level was second-order factor-loading invariance. For this level, the first-order latent factors are assumed to demonstrate the same amount of change in each group for the same increase in g. If achieved, the second-order factor, g, has the same meaning across groups. Finally, we tested the invariance of disturbances (factor unique variances) of the first-order factors. If achieved, this level of invariance shows that the unique aspects of the first-order factors (that not explained by g) are the same across groups. For all analyses, we identified the scale of latent factors by fixing one factor loading for each factor to a value of one. Furthermore, because all factor models are derived from variance–covariance matrices, these matrices were also tested for equivalence across groups. If the matrices are the same, the instrument measures the same construct across groups (Keith, 2019).
Multiple indices of model fit were used to evaluate and compare the models (Bentler & Bonett, 1980; Hoyle & Panter, 1995; Hu & Bentler, 1998, 1999; Kline, 2016; Marsh et al., 1988; McDonald & Ho, 2002). Single models were evaluated using two indexes, the comparative fit index (CFI) and the root mean square error of approximation (RMSEA). CFI values close to or >.95 and RMSEA values close to or <.05 indicated a good fit. RMSEA values close to or <.08 were considered acceptable. Changes in the chi-square (Δχ2) value were used to evaluate competing nested models (Bentler & Bonett, 1980). The Akaike information criterion (AIC) and the more parsimonious sample size adjusted Bayesian information criterion (aBIC) were used to compare competing nested and non-nested models (Kaplan, 2000; Loehlin, 2004), with lower values indicating a superior fit. To determine evidence of invariance, consensus is scant regarding the most appropriate criterion. As recommended by Keith (2015, 2019), both traditional perspectives (Δχ2) and practical perspectives (differences in the comparative fix index CFI; ΔCFI) were jointly evaluated. The Δχ2 test is oversensitive to sample size and deviations from normality (Kline, 2016; West et al., 1995). Cheung and Rensvold (2002) recommended ΔCFI as superior to Δχ2 because of its independence from model complexity, sample size, and overall fit measures. An absolute ΔCFI value (|ΔCFI|) >.01 was proposed as an indicator of a meaningful decr in fit. This criterion is commonly used in research. 2 Given the large sample size, the large number of modeled variables, and the number of comparisons being made in this study, the criterion for rejecting the null hypothesis of invariance was set as p < .001 in the Δχ2 test and |ΔCFI| >.01.
Results
Across-Age Factorial Invariance
Fit of Factorial Invariance Models Across Four Age Bands.
Note. CFI = comparative fit index; RMSEA = root mean square error of approximation; CI = confidence interval; AIC = Akaike information criterion; aBIC = adjusted Bayesian information criterion.
First, the configural model (Model 2) was used to test the nested models and demonstrated a good fit to the data. Children with different ages shared the same WISC-V higher order five-factor patterns and corresponding subtests loaded on the same factors. With the factor pattern established, we imposed cross-group constraints on the first-order factor loadings (Model 3). The addition of the first-order factor loadings constraints reduced the fit according to Δχ2. However, according to ΔCFI the model continued to fit as well as the configural model. The ΔCFI value was 0, suggesting that the first-order factor loadings were equal across ages. We then constrained the subtest intercepts to be equal, while allowing the factor means to vary (Model 4). There was no deterioration of fit with these constraints according to ΔCFI, suggesting that the subtests had the same intercepts across ages. This result is expected to some degree given that all subtest are age standardized. Subsequently constraining the subtest residuals to be equal across groups (Model 5) slightly reduced the fit, but the ΔCFI value did not exceed .01, suggesting an acceptable fit. When structural parameters (second-order loadings and first-order unique variances) were further constrained to be equal between groups (Models 6 and 7), no practical deterioration of fit was observed in either Δχ2 or ΔCFI. Taken together, these sequential tests revealed that the WISC-V measures the same constructs across ages.
Across-Gender Factorial Invariance
Fit of Factorial Invariance Models Across Genders.
Note. CFI = comparative fit index; RMSEA = root mean square error of approximation; CI = confidence interval; AIC = Akaike information criterion; aBIC = adjusted Bayesian information criterion.
There is one more point that deserves attention. For the cross-gender invariance testing, the intercept invariance step constrained factor means to zero for one group only. Correspondingly, two sets of meaningful information could be identified at this level of test. The major one was the results showing a lack of bias for the WISC-V subtests as measures of the five underlying constructs for Taiwanese children, which supported construct validity. The other minor finding was that when the means of the five latent factors in boys were fixed to zero, the nonstandardized latent means(with corresponding standard errors [SEs]) for the VCI, VSI, FRI, WMI, and PSI factors in girls were estimated freely as −.41(.16), −.19(.17), −.50(.15), .08(.17), and .46(.16), respectively. The three significant differences show that boys tended to perform slightly higher on VCI and FRI, whereas girls performed slightly higher on PSI (cf. Keith et al., 2008). Yet differences were small, approximately one-sixth of a standard deviation (given the scaling of the subtests used as indicators of the factors).
The Standardized Estimates and g-Loadings Based on the Entire Sample
Results showed that the WISC-V hierarchical five-factor structure is invariant across age and gender, and thus it is valid to analyze the data across these grouping variables. Standardized estimates based on the entired norm sample (N = 1,034) are displayed in Figure 1. All 16 subtests were loaded strongly on the corresponding factors. Arithmetic was a mixed measure of the FRI, WMI, and VCI factors as expected (factor loadings were .35, .24, and .22, respectively). Across all five first-order factors, FRI had the highest g-loading (.98). All parameter estimates were theoretically reasonable.
Loadings of WISC-V Subtests on the Second-Order g Factor.
Discussion
We conducted this study to determine the invariance of the WISC-V constructs across a large sample of Taiwanese children of different ages and genders. The first and most critical finding was that the higher order five-factor scoring model fit the data from different ages and gender well, and demonstrated construct validity with full factorial invariance. For Taiwan children of different ages and genders, the WISC-V subtests demonstrated the same underlying theoretical latent constructs, strength of relationships among factors and subtests, validity of each first-order factor, and communalities. Invariant results provide evidence that WISC-V index scores and subtests have the same meaning across age and gender groups. Therefore, WISC-V results for boys and girls and for children of different ages can be interpreted in the same manner, and comparisons between ages and genders can be considered meaningful.
Second, several studies have reported the mixed loadings of the Arithmetic subtest (Chen et al., 2009; Weiss et al., 2013a, 2013b). Our findings further demonstrated that these cross loadings exist across ages and genders. For Taiwanese children, performance on the Arithmetic subtest was influenced primarily by FRI, with a small additional influence by VCI and WMI abilities.
Third, among all five factors, FRI loaded the highest on g (.98), which resulted in a nonstatistically significant unique factor variance (t = 1.35) as expected (Bickley et al., 1995; Gutafsson, 1984; Keith et al., 2006). Among all 16 subtests, Arithmetic had the highest g-loading (.73), followed by Digit Span (.71), Letter–Number Sequencing (.69), Visual Puzzles (.69), and Figure Weights (.69). The Cancelation subtest had the lowest g-loading (.35). These findings are similar to previous reports in the United States (Reynolds & Keith, 2017; Sattler et al., 2016; Wechsler, 2014b, p. 84). The new WISC-V subtests, Visual Puzzles, and Figure Weights, displayed high g-loadings of .69 in Taiwan, suggesting that these new subtests are valid indicators of general intelligence.
Interestingly, studies based on U.S. WISC-V data generally indicate that Arithmetic and Vocabulary are the top measures of g (Chen et al., 2015; Reynolds & Keith, 2017), For the Taiwan sample Arithmetic had the highest g-loading as expected; however, g-loadings for various verbal subtests (.63–.67), although still high, were not among the highest of the 16 subtests. When considering the rank orderings, five subtests had higher g-loadings than any verbal subtests (Arithmetic, Digit Span, Letter–Number Sequencing, Visual Puzzles, and Figure Weights). Although the magnitude of the differences in these g-loadings was small, observation of such discrepant rank order patterns between children from Taiwan and the United States suggests possible cultural differences on the relation between g and other abilities (Kvist & Gustafsson, 2008). Future research should investigate such cross-cultural comparisons.
Finally, it is useful to compare the results of this of gender invariance study with those from the United States and Germany. These diverse studies show: While incongruent with the results of partial gender invariance from Germany (Pauls et al., 2019), our results supported the findings in the United States (Chen et al., 2015). Both the Taiwan and U.S. standardization data indicated full factorial invariance between boys and girls on the WISC-V higher order five-factor structure. In all three cultures, Arithmetic was cross-loaded on more than one factor; however, slightly different weightings were identified. In Taiwan, Arithmetic primarily loaded on FRI, whereas it was identified to be primarily loaded on WMI in Germany. Interestingly, in an alternative model this subtest was determined to be primarily loaded on g in the United States (Reynolds & Keith, 2017), a finding not inconsistent with it loading on multiple first-order factors. In the literature, Arithmetic is recognized as the most cognitively complex task in a WISC-V multidimensional scaling analysis conducted by Meyer and Reynolds (2017). Keith and Reynold (2010) also suggested that people may utilize different strategies when answering Arithmetic questions. Current results indicated that there may also be cultural differences on this issue which deserves further explorations. Results from Taiwan and the United States revealed a similar pattern of gender factor differences (these results were not reported in the German study). Generally, boys scored slightly higher on verbal comprehension tasks, whereas girls scored higher on processing speed. Comparatively, the degree of group discrepancies on VCI and FRI seems larger in Taiwan, whereas the degree of gender differences on PSI was more salient in the United States.
This research is not without limitations. In particular, we tested only one possible factor structure, which is most closely aligned with the intended structure and the scoring of the test. It is also noteworthy that we compared variance–covariance matrices across groups, as well, and these also showed invariance. Because any factor structure is derived from such matrices, these findings also suggest invariance exists independent of the chosen factor model. Nevertheless, future research should compare alternative structures, as well. Given invariance as shown here, research comparing alternative structures can use combined age and gender data.
In conclusion, our findings support full factorial invariance of WISC-V across ages and genders. The meaning of each WISC-V subtest and factor-based composite were generally identical for children of different ages and gender. Therefore, WISC-V scores can be meaningfully interpreted in the same manner. More invariance reports from different cultures and subpopulations (e.g., clinical groups) are encouraged.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
