Abstract
The present study investigated measurement invariance across gender on the German Wechsler Intelligence Scale for Children–Fifth Edition (WISC-V). The higher order model that was preferred by the test publishers was tested on a population-representative German sample of 1,411 children and adolescents aged between 6 and 16 years. Confirmatory factor analyses were conducted to test for measurement invariance. As soon as partial scalar invariance could be established by freeing nonequivalent subtest intercepts, results demonstrated that 11 out of 15 subtest scores have the same meaning for male and female children. These findings support interpretable comparisons of the WISC-V test scores between males and females but only in due consideration of partial scalar invariance and with respect to the underlying factor structure. Despite this, however, results did not support the overall structural validity of the higher order model. Thus, replacing the former Perceptual Reasoning factor by Fluid Reasoning and Visual Spatial may be considered inappropriate due to the redundancy of the FRI as a separate factor. Results also indicated that the WISC-V provides stronger measurement of general intelligence (Full Scale IQ) than measurements of cognitive subdomains (WISC-V indexes). Interpretative emphasis should thus be placed on the Full Scale IQ rather than the WISC-V indexes.
Keywords
Intelligence is well-known as one of the best investigated psychological constructs and the Wechsler Intelligence Scales are among the most commonly used diagnostic instruments due to their psychometric properties and clinical relevance (Archer, Buffington-Vollum, Stredny, & Handel, 2006; Groth-Marnat, 2009). Like other prominent intelligence tests, such as the Kaufman Assessment Battery for Children–Second Edition (KABC-II; Kaufman & Kaufmann, 2004), Wechsler Intelligence Scales claim to reflect conceptualizations of intellectual measurement as described in the Cattell–Horn–Carroll theory of intelligence (McGrew, 2005; Schneider & McGrew, 2012). As part of the long tradition of Wechsler scales, the fourth edition of the Wechsler Intelligence Scale for Children (WISC-IV; Wechsler, 2003) introduced a four-factor structure (Verbal Comprehension, Perceptual Reasoning, Working Memory, and Processing Speed), which has long been regarded as more in line with research on patterns of intellectual behavior (e.g., Donders & Warschausky, 1996; Konold, Kush, & Canivez, 1997). This four-factor structure has not only been supported by the test publishers in the standardization sample of the WISC-IV (Wechsler, 2003) but has also been demonstrated by independent researchers in samples of healthy referred (e.g., Nakano & Watkins, 2013; Watkins, Wilson, Kotz, Carbone, & Babula, 2006), clinical (e.g., Devena, Gay, & Watkins, 2013), and special subpopulations (e.g., Styck & Watkins, 2014). Additionally, the WISC-IV factor structure was found to be invariant across gender (H. Chen & Zhu, 2008), different ages (Keith, Fine, Taub, Reynolds, & Kanzler, 2006), and clinical versus nonclinical groups (H. Chen & Zhu, 2012).
The recent Wechsler Intelligence Scale for Children–Fifth Edition (WISC-V; Wechsler, 2014a) represents a major revision of the WISC-IV as it incorporates many significant changes. As the most significant among these changes, the WISC-V redefines the four factors of the WISC-IV into a new five-factor scoring framework comprising the following indexes: Verbal Comprehension (VCI), Visual Spatial (VSI), Fluid Reasoning (FRI), Working Memory (WMI), and Processing Speed (PSI). Previous studies have already attempted to support validity of the five-factor model structure both in normative (e.g., Keith et al., 2006) and clinical samples (e.g., Weiss, Keith, Zhu, & Chen, 2013). As part of the standardization procedure, the WISC-V internal structure has also been examined in light of the accumulated factor-analytical evidence for the WISC-IV and the inclusion of new subtests (Wechsler, 2017b). In this regard, different factor structures were hypothesized and tested to understand the complex nature and number of factors necessary to explain how the WISC-V subtests interrelate. The development of the WISC-V was predicated on the test publishers’ theoretical assumption that the overall scale provides an estimate of general cognitive ability that manifests itself in five cognitive subdomains. This corresponds to a model characterized by five first-order factors (primary WISC-V indexes) and one second-order factor, the Full Scale IQ (FSIQ). In order to examine this assumption on an empirical basis, confirmatory factor analyses had been conducted by the test publishers using maximum likelihood estimation in order to compare alternative model solutions and to identify the best possible factor structure that accounted for the normative data. Each out of five alternative models had been specified in terms of the number of factors included, its hierarchical structure, and its subtest intercorrelations as well as subtest loadings on the hypothesized factors. The models were then analyzed in sequence from simple (e.g., including two first-order factors) to complex (e.g., including five factors) so that the improvement in fit obtained from the increase in complexity could be evaluated statistically. The confirmatory factor analyses provided in the test manual indicate that most of the five-factor model solutions had a significantly better fit to the data than comparable four-factor model solutions. Finally, the second-order five-factor model was preferred and selected by the test publishers on the basis of goodness-of-fit statistics to best represent the WISC-V test structure (see Wechsler, 2017b, for a detailed description).
However, it is worth noting that there is still an ongoing controversial debate on whether a second-order five-factor or a four-factor model structure should be used to best describe performances on the WISC-V subtests. Thus, the overall WISC-V model structure proposed in both the German and the U.S. version of the WISC-V Technical and Interpretive Manual (Wechsler, 2014b, 2017b) is still not unanimously regarded as the best model solution (e.g., Canivez, Watkins, & Dombrowski, 2016, 2017; Dombrowski, Canivez, & Watkins, 2018). Consequently, while the second-order five-factor model proposed by the test publishers may represent the scores derived on the WISC-V, it is not yet clear whether the five primary indexes represent actual attributes that exist outside of the WISC-V scores. In this regard, it must be pointed out that the second-order five-factor structure was chosen as the baseline model not because it is suggested to be the best model solution, but rather because the present study was mostly based on the original standardization data set that has already been applied by the test publishers. Although the present study does not aim to compare different models in order to find the model structure that best fits the empirical data, some issues concerning the underlying WISC-V model are addressed in the discussion section later on.
Besides the examination of hypothesized model structures, another crucial aspect of a test’s validity is the measurement invariance of any diagnostic instrument that allows for comparisons between individuals from different subpopulations. More specifically, this means that interpretable comparisons of statistics such as means and regression coefficients can only be made if the relevant measures are comparable across different groups. In case of the WISC-V, measurement invariance implicates that the underlying measures of intelligence are grounded on the same theoretical structure for the subgroups under examination. Whenever diagnosticians use Wechsler Intelligence Scales for research purposes or in clinical practice, test scores are implicitly assumed to have just the same meaning for individuals from various subgroups. However, this crucial assumption has to be justified on a statistical basis at first so that testing for measurement invariance of the WISC-V becomes indispensable. Measurement invariance is also a major prerequisite for test fairness as invariant scores ensure interindividual differences in test performance to reflect reliable differences in the measured intellectual dimensions (F. F. Chen, Sousa, & West, 2005). Otherwise, a lack of evidence for a specific invariant WISC-V dimension would substantially compromise the corresponding index to be comparable in meaning across different groups (see American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014, for an overview; F. F. Chen et al., 2005; Millsap & Kwok, 2004).
Among various possible subgroup comparisons, invariant scores across gender are for the most part recognized as fundamental for numerous domain-specific measurements (e.g., Atienza, Balaguer, & Garcia-Merita, 2003; Richardson, Huan, Ege, Suh, & Rice, 2014; Rusticus & Hubley, 2006). This is because gender-invariant measures are an essential issue whenever empirically derived data from both genders are combined and treated as if they were collected from a single population. Even though gender invariance has already been shown for the U.S. version of the WISC-V (H. Chen, Zhang, Raiford, Zhu, & Weiss, 2015), evidence is still needed to prove that the recent German adaptation of the WISC-V is not biased against gender as well. If this is true, any given gender difference based on the German WISC-V may in general be considered authentic. Apart from collecting and analyzing data from the German population, it is worth pointing out that the German adaptation also included further significant changes compared with the U.S. version of the WISC-V (WISC-V USA). These changes included the translation of the entire set of verbal items, the exclusion of complementary scores, as well as several modifications to verbal contents (see Wechsler, 2017a, for a detailed description).
The most frequently used technique for testing measurement invariance across different groups is a multigroup confirmatory factor analysis (MGCFA). In a MGCFA, a theoretical model is compared with the observed structures in more than one independent sample. Before addressing a typical sequence of model testing, one crucial distinction has to be made between measurement invariance models and structural invariance models. Measurement invariance needs to be tested to ensure meaningful between-group comparisons and measurement invariance models assess invariance of constructs, factor loadings, item intercepts, and error variances. In contrast to measurement invariance, structural invariance models that assess invariance of variances, covariances, and means of the latent constructs should be tested only if considered theoretical meaningful (see Vandenberg & Lance, 2000, for an overview). To test for measurement invariance, nested models are sequentially analyzed with an increasing number of constrains in each subsequent model (see Jöreskog, 1993; Pauls, Petermann, & Lepach, 2013b, for an overview). Given that the five WISC-V primary indexes (first-order factors) are considered to be substantially correlated with each other, the test publishers suggest a hierarchical second-order five-factor model to likely represent the structure of the WISC-V (e.g., F. F. Chen et al., 2005). The FSIQ is hypothesized to account for the correlations among these WISC-V primary indexes, thus representing a second-order factor.
Even though conducting a MGCFA on a first-order factor model is a more common technique, analyzing second-order factor models has several potential advantages (F. F. Chen et al., 2005). Since it enables to test whether a hypothesized second-order factor actually accounts for the pattern of relations among first-order factors, a second-order factor model may explain the covariance in a more parsimonious way with fewer parameters (Gustafsson & Balke, 1993; Rindskopf & Rose, 1988). Compared with first-order factor models with correlated factors, second-order factor models may at first sight provide a useful simplification of complex measurement structures as implied for the WISC-V (e.g., Eid, Lischetzke, Nussbeck, & Trierweiler, 2003). In contrast, however, Rijmen (2010) points out that there are also some methodological limitations to second-order factor models in general, such as a restricted predictability of first-order factors for external criteria. Moreover, second-order factor models may easily be flawed by overfactoring (Frazier & Youngstrom, 2007), thus, including empirically redundant factors and abandoning the parsimony of simple structures (Le, Schmidt, Harter, & Lauver, 2010). Since bifactor models have already fit data well from other Wechsler scales (Canivez, 2014; Gignac & Watkins, 2013; Nelson, Canivez, & Watkins, 2013; Watkins & Beaujean, 2014), various researchers have suggested a bifactor model to offer a better representation of the structure of intelligence than a higher order factor model (e.g., Alexandre, Morin, Arens, Antoine, & Hervé, 2015; Beaujean, 2015; Brunner, Nagy, & Wilhelm, 2012; Canivez, 2016; Gignac, 2005, 2006, 2008). The latter describes general intelligence as a hierarchical construct fully meditated by the lower order factors and only indirectly influencing the subtest indicators, whereas a bifactor model portrays general intelligence as a broad factor with direct effects on the subtest indicators. Consequently, bifactor modeling does not only enable the estimation of model-based reliability but also the proportion of variance due to a general factor such as the FSIQ and lower order factors such as the primary WISC-V indexes. Another useful technique to facilitate the interpretation of model-based reliability and variance proportions in higher order factor models is the Schmid and Leiman procedure (SL procedure; Schmid & Leiman, 1957). The SL procedure can be defined as a reparameterization of a higher order factor model as it orthogonalizes all first- and second-order factors included to allow for the interpretation of their relative impact on each indicator. Because subtest scores reflect both the first- and second-order variance, the second-order factor variance has to be extracted first in order to residualize the variance from the first-order factors, thus leaving the first-order factors orthogonal to the second-order factor (Carroll, 2003; Dombrowski et al., 2018). Estimating the unique proportions of variance according to an orthogonalized higher order factor model then permits to determine how much interpretative emphasis should be placed on the second-order factor and the first-order factors.
In summary, the current study aims to clarify whether the second-order five-factor model that has been hypothesized by the test publishers is transferable across both genders. According to measurement invariance, results should then help evaluate whether the subtests of the German WISC-V measure the suggested latent dimensions in the same manner for both males and females. A second goal is to provide estimates of model-based reliability and construct replicability using the SL procedure and to disclose how much common subtest variance is due to the FSIQ and how much is due to the WISC-V indexes. This should then help evaluate the adequacy of the indexes based on how much unique variance they explain when adjusted for the effects of the FSIQ.
Method
Sample
For conducting single-group and MGCFA in the present study, WISC-V data of 1,411 children and adolescents (males, n = 711; females, n = 700) was selected from the extended data set of the German standardization sample. The nationally representative sample was divided into a total of 11 age groups ranging from 6 to 16 years of age, with an almost balanced number of cases and slightly varying gender distributions within the age groups. This standardization sample was carefully selected to match the recent German census for major demographics including gender, age, parental education, type of school, region, and migration background. The extent to which matching procedures could be achieved was determined by Mann–Whitney and Kruskal–Wallis test statistics for nonparametric data. These test statistics indicated no gender differences in age (U = 246782.00, z = −.271, p = .786), parental education, χ2 (1, N = 1,411) = 2.081, p = .149; type of school, χ2 (1, N = 1,411) = 0.227, p = .634; and in migration background, χ2 (1, N = 1,411) = 1.253, p = .263. A detailed description of the standardization sample is provided in the corresponding German WISC-V manual (Wechsler, 2017b).
Instrumentation
The German WISC-V includes 10 primary subtests and five secondary subtests. Among the 10 primary subtests are Block Design (BD), Similarities (SI), Matrix Reasoning (MR), Digit Span (DS), Coding (CD), Vocabulary (VC), Figure Weights (FW), Visual Puzzles (VP), Picture Span (PS), and Symbol Search (SS). Concerning reliability, all primary subtests have demonstrated good to excellent internal consistency coefficients with Cronbach’s alpha ranging from .81 to .93. The five secondary subtests are Information (IN), Letter–Number Sequencing (LN), Cancellation (CA), Comprehension (CO), and Arithmetic (AR). Compared with the primary subtests, secondary subtests revealed slightly lower but still good internal consistency coefficients with Cronbach’s alpha ranging from .80 to .87. The secondary subtest Picture Concepts (PC), which is included in the WISC-V USA, has been neither intended for nor included in the European version of the WISC-V in general, which served as a standardization kit for the German adaptation. Therefore, all analyses in the present study were based on 15 subtests in total.
Although alternative measures of internal consistency are neither reported for the United States nor the German standardization samples of the WISC-V, omega-hierarchical and omega-hierarchical subscale coefficients have been recommended to replace Cronbach’s alpha for hierarchical model structures (Brunner et al., 2012; Reise, 2012; Sijtsma, 2009; Yang & Green, 2011). Thus, model-based reliability estimates are additionally described in detail later on and reported in the results section.
Analytical Procedures
Since the data set of the present study has already been subjected to model comparisons from which the test publishers have concluded that their confirmatory approach on the internal WISC-V test structure indicated a satisfactory model solution (see Wechsler, 2017b, for a detailed description), the second-order five-factor structure was used as the baseline model for all subsequent analyses. Prior to invariance analyses, the second-order five-factor model was tested for males and females separately in order to check for its overall fit in both subgroups. For this purpose, the five-factor structure and the underlying formal scoring procedure reported in the WISC-V Technical and Interpretive Manual (Wechsler, 2014b) were used to specify the baseline model. When taking the entire set of 15 WISC-V subtests into account, the hypothesized baseline model incudes five first-order factors and one second-order factor. First-order factors are represented by the five primary WISC-V indexes which were designed to reflect specific latent abilities: VCI indicated by four subtests (SI, VC, IN, and CO), VSI derived using two subtests (BD and VP), FRI indicated by three subtests (MR, FW, and AR), WMI composed of three subtests (DS, PS, and LN), and PSI derived using three subtests (CD, SS, and CA). In addition to FRI, the subtest AR was permitted to cross-load on the first-order factors VCI and WMI as it was indicated in the WISC-V Technical and Interpretive Manual (Wechsler, 2014b) to consider further possible indicators (H. Chen, Keith, Chen, & Chang, 2009; Weiss et al., 2013). The second-order factor FSIQ was specified to account for the intercorrelations among the five first-order factors. However, correlations were neither hypothesized among disturbances of the first-order factors, nor specified among residual variances of the observed variables included.
Statistical procedures to test for measurement invariance across gender were based on the analysis of mean and covariance structure models using AMOS 25 (Arbuckle, 2017). For all confirmatory factor analyses conducted, the scale of latent variables was identified by fixing one factor loading of each factor to one (Keith, 2015). The subtest scaled scores were used for all required analyses and each subtest was initially checked for normality. In the female group, skewness for the data on the 15 WISC-V subtests ranged from −.34 to .09 and kurtosis ranged from −.26 to .71. In the male group, skewness ranged from −.39 to .12 and kurtosis ranged from −.35 to .60. The maximum likelihood procedure was chosen for model estimation since it is one of the most robust estimation procedures available (Hu & Bentler, 1998) and the examination of skewness and kurtosis did not reveal excessive deviation from normality. It has to be noted that maximum likelihood estimation is considered adequate for data with an absolute value less than 2 for skewness and an absolute value of less than 7 for kurtosis (see West, Finch, & Curran, 1995, for an overview).
After indicating a reasonable fit of the hypothesized second-order five-factor model to the data of the male and female groups separately, the degree of measurement invariance across gender was examined by testing six different levels of nested models (e.g., Keith, 2015; Wicherts & Dolan, 2010). With respect to the hierarchical structure of testing for invariance, each of these levels was specified according to the number of parameters being constrained. Therefore, all nested models included were sequentially analyzed with decreasing numbers of parameters to be estimated due to the inclusion of parameter constraints one at a time (Jöreskog, 1993). Given that each subsequent model and its corresponding parameter constraints were nested in the previous model, invariance models became increasingly more restrictive. The initial and weakest level of invariance was tested at the configural level. At this first level, invariance required the overall baseline model structure (e.g., the number and pattern of factors) to be equal across gender. Provided that configural invariance could be established (M1), the model at the second level tested whether males and females responded to the test items in the same way. For this first-order metric invariance model (M2), all loadings of subtests (observed variables) on first-order factors (latent variables) were constrained to be equal across gender. At the third level, defined as the second-order metric invariance model (M3), all second-order factor loadings were additionally constrained to be equal across gender so that the scales of the latent factors as well as the unit of measurement could be considered invariant. For the scalar invariance model (M4) at the fourth level, all subtest intercepts were additionally constrained to be equal across gender. Establishing scalar invariance would then indicate that examinees with the same score on a certain latent variable should as well obtain the same score on the observed variable regardless of their gender. At the fifth level, the equivalence of variances in measurement errors was examined by constraining all error terms of the observed variables equal across gender. This way testing for residual invariance (M5) should clarify whether all gender-related differences on the observed variables were attributable to gender-related differences on the corresponding latent variables. For the sixth and final level of measurement invariance (M6), the disturbances of first-order factors were additionally constrained to be equal across gender. Although this level of invariance is rather optional, it may provide crucial information on whether unique first-order factor variances that are not shared by the second-order factor may be considered invariant across gender.
For evaluating and comparing the latent factor structures of the invariance models, a set of different indexes of model fit was analyzed to overcome the limitations of each single index (see Bentler & Bonett, 1980; Kline, 2010; Marsh, Balla, & Hau, 1996; McDonald & Ho, 2002; Thompson, 2000, for an overview). Each invariance model was jointly evaluated by examining the likelihood ratio chi-square statistic (Satorra & Saris, 1985), the standardized root mean square residual (SRMR), and the root mean square error of approximation (RMSEA) as absolute fit indexes. If χ2 is not significant, the null hypothesis that the observed covariance matrix equals the covariance matrix implied by the model cannot be rejected so that the model fit can be considered acceptable. The SRMR value of zero indicates perfect fit, values less than .05 correspond to a good fit, and a value of .08 can be regarded as acceptable. The SRMR value of zero represents a perfect fit and values less than .08 are considered as a good fit (Hu & Bentler, 1999). An RMSEA value of less than or equal to .01 indicates an excellent fit, a value of .05 corresponds to a good fit, and an RMSEA value of .10 should be used as a cutoff for poor-fitting models (MacCallum, Browne, & Sugawara, 1996). As recommended by F. F. Chen (2007), a change in RMSEA values greater than .015 should also be inspected as a measure of relative fit when comparing the fit of two structural models. The corresponding 90% confidence intervals for all RMSEA values are reported for the sake of completeness. Additional parsimonious fit indexes used were the chi-square to degrees of freedom ratio (χ2/df; Wheaton, Muthén, Alwin, & Summers, 1977), with a ratio of 5:1 or less indicating an acceptable model fit (Schumacker & Lomax, 2004), and the comparative fit index (CFI; Bentler, 1990), with values above .95 corresponding to a good fit (Hu & Bentler, 1999). Finally, the Akaike information criterion (Akaike, 1987) was used as an information-theoretic fit index to compare nested and nonnested models, with lower values indicating a better fit (Kaplan, 2000).
Since there is at least some consensus concerning the most appropriate criterion to be used for determining evidence of measurement invariance (Byrne & Stewart, 2006), invariance models were evaluated from a traditional and a practical perspective (Keith, 2015). For the traditional perspective, differences between χ2 values of successive models (Δχ2) were used to test the hypothesis of whether the more restrictive invariance model is significantly different from the less restrictive one. The absence of a significant change in χ2 would then imply that both invariance models fit the data equally well. In addition, Cheung and Rensvold (2002) recommended to use ΔCFI as a test of invariance being independent from model complexity, sample size, and overall fit measures. As proposed by the authors, an absolute ΔCFI value above .01 was chosen as an indicator of a meaningful drop in model fit. For the overall evaluation of each level of invariance, Δχ2 and ΔCFI tests were thus jointly evaluated to reach a meaningful and unbiased decision. In cases where both fit indexes showed contrary results, decision making was mainly based on the more liberal ΔCFI (F. F. Chen et al., 2005) due to the oversensitivity of Δχ2 tests to large sample sizes (Kline, 2010). Moreover, moderate discrepancies from normality in the data might easily lead Δχ2 tests to reject the model. Since only using a more liberal test may increase the overall risk of making Type II errors; however, the traditional Δχ2 test was still used in addition to the ΔCFI test as recommended by Byrne and Stewart (2006). If full invariance had to be rejected, suggestions regarding partial invariance were carefully considered (e.g., Byrne & Watkins, 2003). In cases where inadequate model fit was detected, it was improved by relaxing nonequivalent parameters, which were identified by an analysis of the critical ratios for the pairwise parameter comparisons provided by AMOS.
Since the commonly used reliability coefficient Cronbach’s alpha has been considered to be limited by its unrealistic theoretical assumptions (Dunn, Baguley, & Brunsden, 2014; Sijtsma, 2009; Yang & Green, 2011), the use of model-based reliability estimates has generally been recommended as an alternative measure (Brunner et al., 2012; Reise, 2012; Rodriguez, Reise, & Haviland, 2016). In particular, omega (ω), omega-hierarchical (ωH), and omega-hierarchical subscale (ωHS) coefficients have frequently been reported in previous research as a suitable way to estimate reliability for multidimensional constructs (e.g., Canivez et al., 2016; McDonald, 1999; Watkins, 2013). Reliability estimation using ω is based on the proportion of total systematic variance in each factor attributed to the blend of general and subscale variance. The ωH coefficient represents the higher order factor reliability estimate adjusted for the subscale variance, whereas ωHS estimates the reliability of each lower order factor independent of all other subscale and general factor variances.
Although omega coefficients have been primarily referred to as model-based reliability estimates, they may also be regarded as validity estimates as they allow for conclusions regarding the plausibility of interpreting specific model factors (Dombrowski et al., 2018). In the case of the WISC-V, it means that ωHS may control for that part of reliability attributable to the FSIQ and should thus be useful for judging the utility of each index score (see Reise, 2012, for an overview). A robust ωHS coefficient would indicate that most of the reliable variance of the corresponding index is rather independent of the FSIQ, so that individual cognitive strengths and weaknesses may be meaningfully interpreted on the index level (Brunner et al., 2012). In contrast, low values of ωHS would suggest that most of the reliable variance of the indexes are due to the FSIQ. This would then compromise the interpretability of index scores as unambiguous indicators of specific cognitive domains (Rodriguez et al., 2016). Although there are no absolute standards for evaluating the magnitude of ω, ωH, or ωHS; however, it has been suggested that values near .75 might be preferred and values should not be less than .500 (Reise, Bonifay, & Haviland, 2013). In the present study, model-based reliability was estimated using ωH and ωHS coefficients in order to describe how precisely index scores reflect their intended dimensions and to determine whether these scores provide unique information above and beyond the FSIQ (see Rodriguez et al., 2016, for an application). Omega coefficients were supplemented with the construct replicability coefficient H (Hancock & Mueller, 2001) to estimate how adequate the latent variables are represented by their corresponding indicators. Values should not be less than .700 for the H coefficient to ensure a high quality of indicators and replicability of latent variables (Hancock & Mueller, 2001; Rodriguez et al., 2016). High H coefficient values indicate that the corresponding latent variables are well-defined by their indicators and will possess more stability across studies, whereas latent variables with low H coefficient values are more likely to change across studies.
The Omega program (Watkins, 2013) was used to obtain all omega coefficients, H coefficients, and other sources of variance according to an orthogonalized higher order factor model with five first-order factors. To derive decomposed variance sources from the second-order five-factor model, the SL procedure was initially applied using the MacOrtho program (Watkins, 2004).
Results
Baseline Model Identification
Model identification was based on the WISC-V subtest variance–covariance matrix presented in Table 1, Table 2, and Table 3 show all phases and steps required for the present measurement invariance analyses. As indicated by all goodness-of-fit indexes reported for the single-group confirmatory factor analyses in Phase 1 (see Table 2), the initially hypothesized second-order five-factor model fit both male and female groups equally well. Invariant variance–covariance matrices across gender were thus supported, suggesting that the WISC-V factor structure should be quite similar for males and females. Since almost all goodness-of-fit indexes were found to be within an acceptable range, the second-order five-factor model served as the baseline model for all subsequent invariance models.
Variances, Covariances, and Means for Both Genders on the 15 WISC-V Subtests.
Note. WISC-V = Wechsler Intelligence Scale for Children–Fifth Edition; SI = Similarities; VC = Vocabulary; IN = Information; CO = Comprehension; BD = Block Design; VP = Visual Puzzles; MR = Matrix Reasoning; FW = Figure Weights; AR = Arithmetic; DS = Digit Span; PS = Picture Span; LN = Letter–Number Sequencing; CD = Coding; SS = Symbol Search; CA = Cancellation. Variances, covariances, and means for the female sample are indicated by values in parentheses below the diagonal; variances, covariances, and means for the male sample are indicated by values without parentheses above the diagonal.
WISC-V Single-Sample Goodness-of-Fit Statistics for the Baseline Model Analyses (Phase 1).
Note. df = degrees of freedom; SRMR = standardized root mean square residual; RMSEA = root mean square error of approximation; CI = confidence interval for RMSEA; CFI = comparative fit index; AIC = Akaike’s information criterion.
Multigroup Goodness-of-Fit Indices and Model Comparisons for the WISC-V Second-Order Factor Model.
Note. M1 = unconstrained baseline model; M2 = model with equal loadings on all first-order factors; M3 = M2 with equal loadings on the second-order factor; M4 = M3 with equal subtest intercepts; M4† = M4 with relaxed subtest intercepts (IN, FW, CD, and CA); M5† = M4† with equal error variances on all subtests; M6† = M5† with equal disturbances of first-order factors; df = degrees of freedom; SRMR = standardized root mean square residual; RMSEA = root mean square error of approximation; CI = confidence interval for RMSEA; CFI = comparative fit index; AIC = Akaike’s information criterion; ΔRMSEA = difference in RMSEA between compared models; ΔCFI = difference in CFI between compared models; Δχ2 = chi-square difference between compared models; Δdf = difference in degrees of freedom between compared models.
Multigroup Invariance Analyses
In Phase 2, MGCFA were conducted in sequence by constraining nested invariance models in a stepwise manner. In a first step, configural invariance was tested by comparing all factor patterns of the unconstrained baseline model for both genders simultaneously. As shown in Table 3, the configural invariance model (M1) provided an acceptable fit to the data, suggesting that both genders shared the same WISC-V factor patterns with subtests loading on the same corresponding latent factors.
With the gender-invariant factor pattern established, metric invariance was then tested by constraining all first-order factor loadings to be equal across males and females (M2). When comparing M2 and M1, the Δχ2 test indicated a significant deterioration of fit (Δχ2 = 35.622, Δdf = 12, p < .001). Since the CFI values showed merely no difference between the compared models (ΔCFI = −.002), and other fit indexes also exhibited good fit for M2, it was finally concluded that there were no substantial differences in first-order factor loadings accounted for by gender.
To complement the metric invariance approach, cross-group constraints on the second-order factor loadings were additionally imposed in the subsequent model (M3). The constrained model fit the data well as indicated by the nonsignificant difference of χ2 values (Δχ2 = 9.622, Δdf = 4, p = .047) and an almost imperceptible difference of CFI values (ΔCFI = −.001). Consequently, the strengths of the linear relationships between the higher order factor FSIQ and the underlying five first-order factors were considered invariant across gender. In testing for scalar invariance, all subtest intercepts were additionally constrained to be equal (M4). Compared with M3, the inclusion of subtest intercept constraints substantially reduced the overall model fit according to both the χ2 values (Δχ2 = 223.725, Δdf = 15, p < .001) and CFI values (ΔCFI = −.021). For this reason, full scalar invariance had to be rejected and an additional partial scalar invariance model (M4†) was specified by closely examining and comparing all subtest intercepts between both groups. Critical ratios for the pairwise parameter comparisons indicated that the misfit could be attributed to unequal intercepts of IN, FW, CD, and CA. In the female group, subtest intercepts were estimated as 7.71 for IN, 8.63 for FW, 9.61 for CD, and 9.64 for CA, respectively. Estimated subtest intercepts in the male group were slightly higher for IN (8.55) and for FW (9.30), but lower for CD (8.75) and CA (8.92). As soon as these parameters were freed in M4†, model fit improved substantially when compared with M3. Although the χ2 difference still appeared to be significant when relaxing intercept constraints on IN, FW, CD, and CA at once, model comparisons revealed almost no difference in CFI values, thus indicating an acceptable fit for M4† (Δχ2 = 66.251, Δdf = 11, p < .001; ΔCFI = −.005). Subtest intercepts of IN, FW, CD, and CA were found not to be invariant across gender, whereas scalar invariance could be established for all other WISC-V subtests. For all subsequent invariance models, parameter restrictions required were based on the partial scalar invariance model (M4†) with relaxed nonequivalent subtest intercepts. In M5†, the constraint of equivalent error variances was further imposed. As indicated by the Δχ2 and ΔCFI tests, the model fit the data well (Δχ2 = 23.201, Δdf = 15, p = .080; ΔCFI = −.002). Therefore, error variances were considered not to vary with gender. In the final step and in addition to all previous constraints, disturbances of all first-order factors were set to be equal across gender (M6†). The Δχ2 test between M5† and M6† turned out to be significant (Δχ2 = 23.926, Δdf = 5, p < .001). Once again, there was no substantial difference in CFI values (ΔCFI = −.002) and all other fit indexes were within acceptable range, so that the hypothesis of gender-invariant disturbances of first-order factors was tenable. Thus, unique first-order factor variances that are not shared by the common second-order factor turned out to be equal across males and females.
Although invariance could not be fully established at all steps of the analyses, it was concluded in due consideration of the complexity of the model and the strictness of the test that the five primary indexes of the German WISC-V feature at least acceptable levels of invariance across the male and female groups. Accordingly, differences in subtest scores are, at least for the most part, attributable to differences in the underlying latent dimensions, and 11 out of 15 WISC-V subtest scores are not biased based on the gender status.
Model Parameter Estimations and Model-Based Reliability
Standardized parameter coefficients based on the most restrictive invariance model (M6†) are presented in Figure 1. All estimated values included were theoretically sound and, due to metric invariance, equivalent across gender. Consistent with the literature (e.g., H. Chen et al., 2015; Weiss et al., 2013), AR could be confirmed as a mixed measure loading on VCI (.14), FRI (.27), and WMI (.40). Among all five first-order factors, FRI had the highest loading (.98) and PSI had the lowest loading (.54) on the FSIQ.

Second-order five-factor model including standardized estimations for both genders on the 15 WISC-V subtests (M6† in Table 1).
The second-order five-factor model in the present study postulates that the lower order factors fully mediate the relationships between the higher order factor and all indicators included (see Cattell, 1965; Lee & Cadogan, 2013, for an overview). The SL procedure was additionally conducted on the second-order five-factor model to derive direct relationships between the second-order factor and the indicators. This technique enabled the evaluation of model-based reliability, construct replicability, and other sources of variance.
According to the SL orthogonalized higher order factor model, Table 4 shows that almost all 15 WISC-V subtest indicators featured reasonable loadings on the second-order factor and on their corresponding first-order factors. Decomposed variance estimates further indicated that the explained common variance (ECV) uniquely associated with the first-order factors was rather small, ranging from .008 (FRI) to .130 (PSI). By comparison, considerably more common variance could be explained by the second-order factor (.648). This means that with about 65% the greatest portion of ECV in the subtest indicators appeared to be uniquely associated with the FSIQ. Values for the ω coefficient ranged from .554 (FRI) to .906 (FSIQ), indicating that the common variances in most of the individual unit-weighted composite scores were attributable to the second-order factor and their underlying first-order factors. Consistent with the ECV estimates, however, the analysis revealed rather small ωHS coefficient values for all first-order factors, ranging from .043 (FRI) to .456 (PSI), when compared with the ωH coefficient value of .798 for the second-order factor. While the second-order factor turned out to be precisely measured and not excessively influenced by variability in other factors, all ωHS coefficient values for the first-order factors were below the required minimum criterion of .500 (Reise et al., 2013). The H coefficient values, which are also presented in Table 4, provide the correlations between the latent factors and optimally weighted composite scores (Rodriguez et al., 2016). While the H coefficient for the second-order factor (.896) indicated that the FSIQ was well defined by the 15 subtest indicators, all H coefficient values for the first-order factors, ranging from .064 (FRI) to .625 (PSI), failed to meet the required minimum criterion of .700 (Hancock & Mueller, 2001; Rodriguez et al., 2016). Thus, the primary WISC-V indexes appeared not to be adequately defined by their subtest indicators. Consequently, the 15 WISC-V subtests cannot be suggested to produce stable scores on their related indexes across studies.
Sources of Variance in the WISC-V 15 Primary Subtests for the German Standardization Sample (N = 1,411) According to a SL Orthogonalized Higher Order Factor Model With Five First-Order Factors.
Note. WISC-V = Wechsler Intelligence Scale for Children–Fifth Edition; SL = Schmid and Leiman procedure; VCI = Verbal Comprehension index (first-order factor); VSI = Visual Spatial index (first-order factor); FRI = Fluid Reasoning index (first-order factor); WMI = Working Memory index (first-order factor); PSI = Processing Speed index (first-order factor); FSIQ = Full Scale IQ (second-order factor); SI = Similarities; VC = Vocabulary; IN = Information; CO = Comprehension; BD = Block Design; VP = Visual Puzzles; MR = Matrix Reasoning; FW = Figure Weights; AR = Arithmetic; DS = Digit Span; PS = Picture Span; LN = Letter–Number Sequencing; CD = Coding; SS = Symbol Search; CA = Cancellation; b = loading of subtest on factor; S2 = variance explained; h2 = communality; u2 = unique variance; ECV = explained common variance; ω = omega coefficient; ωH = omega-hierarchical coefficient (general factor); ωHS = omega-hierarchical coefficient (group factors); H = replicability index (construct reliability); PUC = percentage of uncontaminated correlations.
Discussion
The major aim of the present study was to determine measurement invariance of the WISC-V latent structure, which was proposed by the test publishers, across large samples of male and female children and adolescents. The sources of variance (e.g., model-based reliability and construct replicability) in the 15 WISC-V subtests were additionally analyzed using the SL procedure to evaluate the adequacy of the FSIQ and the five primary WISC-V indexes.
The first and most crucial finding is that the hypothesized second-order five-factor model could be shown to fit the data from both genders equally well. In particular, configural and metric invariance are often suggested as the most important steps in invariance testing (Keith & Reynolds, 2012). First- and second-order factor loadings were found to be gender-invariant regardless of statistical criterion (ΔCFI or Δχ2). This means that scales of the WISC-V latent variables were the same and the unit of measurement was identical. For the German WISC-V, it could be demonstrated that the subtests under examination measured the same underlying theoretical constructs across gender, that there was no gender difference in strength of relationships among factors and subtests regardless of gender, and, for the most part, that the same validity of each first-order factor was featured across gender. Therefore, comparisons of WISC-V index scores between males and females can be regarded as permissible.
Even though full invariance across gender could not be established in all steps of the analyses, results strongly support the assumption of measurement invariance as long as allowing specific subtest intercepts to be unequal across gender. Scalar invariance is said to be a precondition for meaningful mean score comparisons across groups. In practice, however, partial scalar invariance is not uncommon (e.g., Immekus & Maller, 2010) and a full invariance of subtest intercept is hard to fulfill in general (Keith & Reynolds, 2012). It is therefore not unusual that the traditional approach applied to test for full measurement invariance may lead to the rejection of full scalar invariance. As a consequence, all comparisons including observed and latent means could then be biased (see Millsap, 2011, for an overview). Steinmetz (2013) could thus demonstrate that unequal subtest intercepts may indeed substantially affect differences in factor means and the probability of significant differences. On the other hand, however, Muthén and Asparouhov (2013) recently recommended testing for approximate rather than full measurement invariance. They suggested that traditional full measurement invariance tests are often too strict and that allowing for partial scalar invariance may conclude that the underlying latent abilities are after all comparable. In line with the conception of Muthén and Asparouhov, small between-group differences in subtest intercepts that were found for IN, FW, CD, and CA should not preclude the usefulness of these subtests in measuring the underlying latent abilities. One explanation for gender differences in the subtest intercepts could be that males showed slightly higher scores on IN and FW, but slightly lower scores on CD and CA than females. This means that IN and FW might be slightly harder for females, whereas CD and CA might be slightly harder for males than would be expected for a given score on the underlying latent factor. Such differences can be a result of the critical subtests measuring different narrow abilities than the other subtests loading on the same latent variable. Previous literature on gender differences in a variety of cognitive abilities already discussed the roles of gender-specific modality preferences, psychosocial and biological factors, strategies in information processing, and genetic effects (Daseking, Petermann, & Waldmann, 2017; Goldbeck, Daseking, Hellwig-Brida, Waldmann, & Petermann, 2010; Lepach, Reimers, Pauls, Petermann, & Daseking, 2015; Lynn, 1994; Pauls, Petermann, & Lepach, 2013a). Following the recommendations by Byrne, Shavelson, and Muthén (1989) that full scalar invariance is not an indispensable prerequisite for further tests of invariance, subsequent invariance analyses could be conducted based on the partial scalar invariance model. Finally, at least 11 out of 15 WISC-V subtests can be suggested to produce index scores that are fully invariant across gender.
Since some WISC-V subtests may require more than one cognitive ability, previous studies already reported mixed loadings for AR on different latent factors regardless of gender (e.g., H. Chen et al., 2009; H. Chen et al., 2015; Weiss et al., 2013). The present study also provides evidence that such cross-loadings exist for both males and females. Although slightly deteriorating the parsimony of simple structures, partial contributions of Fluid Reasoning, Working Memory, and Verbal Comprehension should be considered when interpreting performances on the AR subtest regardless of gender.
In line with the literature, the current findings also suggest a very strong relationship between FRI and the second-order factor FSIQ. Since fluid abilities are known to be essential for the overall human cognition, there are studies reporting considerably high loadings on the higher order factor (e.g., Keith et al., 2006). According to the SL orthogonalized higher order factor model, the highest loadings on the second-order factor were found for SS (.741), followed by CD (.696), AR (.671), VC (.644), and LN (.604). CA had the lowest loading on the second-order factor (.205). These findings are for the most part comparable to the indirect FSIQ-relations reported for the U.S. versions of the WISC-IV (Keith et al., 2006) and the WISC-V (H. Chen et al., 2015). Since VP (.570) and FW (.457) were found to feature sufficiently strong associations with the second-order factor, these new subtests can be regarded as relatively reasonable indicators of the FSIQ. Although all subtest loadings on the second-order factor appeared to be significantly different from zero, CA (.205) featured the lowest loading of all subtest under examination. This finding is not surprising, given that this secondary subtest had already not been used to derive the FSIQ in the WISC-IV. In practice, secondary subtests of the WISC-V can be additionally administered to provide a broader sampling of intellectual functioning and to yield more information for clinical issues (see Wechsler, 2014a, for a detailed description).
Based on the given second-order five-factor model structure, the present study supported gender invariance, at least for most of the German WISC-V subtests. When comparing the male and female groups, 11 out of 15 WISC-V subtests showed an almost identical latent construct structure, nearly the same strength of relationships among latent factors and subtests, the same validity for each of the five first-order factors, and almost similar subtest intercepts. Four subtest intercepts were identified to slightly violate the assumption of invariance between the male and female group. Therefore, scores on IN, FW, CD, and CA cannot yet be considered comparable across genders. Even though gender invariance could be established for the most part, some limitations of the present study also highlight that further research on the overall WISC-V factor structure is needed to provide more evidence for structural validity of the recently published German WISC-V test battery.
Despite the insights obtained, the most crucial limitations of the present study still deserve attention. First, single and MGCFA based on classical test theory were conducted to test for measurement invariance. However, it should be noted that other approaches based on item response theory could provide different results (e.g., Kim, Kim, & Kamphaus, 2010; Raju, Laffitte, & Byrne, 2002). Comparing results based on both methodologies would thus be meaningful. Additionally, the factor structure of the German WISC-V that has been preferred by the test publishers has not yet been compared with a sufficient number of alternative models, such as models with supplementary cross-loadings or different factor structures (e.g., a first-order five-factor model with intercorrelations between latent factors). Although the present study supported measurement invariance across gender on a specific WISC-V factor structure presented in the WISC-V technical manual, no conclusions can be drawn with respect to gender invariance on bifactor models or models with fewer factors.
Moreover, those model specifications and comparisons presented in the technical manuals of both the WISC-V USA and the German WISC-V do not provide any information about the decomposed sources of variances between the FSIQ and the first-order factors and the publishers failed to report how scales were set for all latent variables included. Since crucial model characteristics such as the df do not entirely correspond to what would have been expected, some parameters might have been fixed in certain models prior to estimation and comparison, but were not reported in the technical manual of the WISC-V USA. Finally, Beaujean (2016) pointed out that other modelling procedures such as the modified effects coding used by the WISC-V publishers could likely lead to a rejection of alternative model solutions that also fit the data well.
Since the primary WISC-IV index Perceptual Reasoning has been replaced by Visual Spatial and Fluid Reasoning in the WISC-V, there is an ongoing discussion on this subject (see Canivez & Kush, 2013, for an overview). In fact, EFA results of recent studies could demonstrate that the WISC-V factor structure might be best described by four first-order factors (Canivez et al., 2016). A five-factor structure, however, has not yet been clearly identified in confirmatory factor analyses (Canivez et al., 2017; Dombrowski et al., 2018). It should also be noted, that the second-order five-factor model proposed by the test publishers could neither be replicated in analyses of the Canadian (Watkins, Dombrowski, & Canivez, 2017), French (Lecerf & Canivez, 2017), nor Spanish versions of the WISC-V (Fenollar-Cortés & Watkins, 2019). And although the second-order five-factor model that was claimed in the WISC-V manual tends to sufficiently fit the data in the present study (see Table 2), the high loading of FRI on the FSIQ (.98) might at least be an indication for the redundancy of a separate FRI factor. The inadequacy of FRI as a separate factor was also supported by the present analyses of variance sources according to the SL orthogonalized higher order factor model. Results showed that the FRI explained by far the smallest portion of common subtest variance (ECV = .008) and featured the weakest model-based reliability (ωHS = .043) as well as the lowest degree of construct replicability (H = .064) of all first-order factors included. This does not support a separation of single factors and index scores for VSI and FRI such as it is promoted in the WISC-V manual. It is worth noting that David Wechsler originally viewed intelligence as an effect rather than a cause, and asserted that nonintellectual factors, such as personality, contribute to the development of each person’s intelligence. Consequently, the Wechsler intelligence scales have not been developed based on a particular theory of intelligence, but rather as single test batteries that combine certain sets of popular subtests. Reasonable post hoc descriptions of the resulting combined scores were usually provided afterward. The subsequent revisions and extensions have followed this approach closely. In view of this, it is even considerable that the FRI might just comprise some multidimensional cognitive abilities rather than representing unidimensional ones. Accordingly, the weak SL reparameterized subtest loadings on the FRI may suggest that the inclusion of MR (.147), FW (.149), and AR (.151) as indicators for Fluid Reasoning could be driven more by pragmatic reasons than being theoretically justified and empirically supported. A more precise measurement of Fluid Reasoning as distinct from the FSIQ may therefore require the creation and inclusion of more unique indicators.
Canivez and Kush (2013) addressed some problems with the AR subtest loading on more than one primary index. Although the authors stated that this subtest might be considered a measure of quantitative reasoning and serves as a good indicator for the FSIQ, they recommend excluding AR from future versions of the Wechsler scales (Weiss et al., 2013). One major issue with AR is possibly that there are still no similar measures of quantitative reasoning included in the recent version of the WISC so that either additional quantitative reasoning tasks should be created or AR should be abandoned at all (Canivez et al., 2016; Canivez & Kush, 2013; Watkins & Ravert, 2013). From a psychometric perspective, however, an exclusion of AR without any appropriate substitution would lead to a decrease in the total number of valid indicators. Without any additional parameter constraints based on theoretical considerations, this could then significantly worsen the overall model fit. In summary, the issues associated with the inclusion of a separate Fluid Reasoning factor as well as maintaining AR as a cross-loading subtest clearly demonstrate that more research utilizing both CFA and EFA procedures is needed to provide a profound insight into the nature of the German WISC-V model structure.
The assessment of variance sources according to a SL orthogonalized higher order factor model also revealed the dominance of the FSIQ and the limited unique measurement of the five primary WISC-V indexes. Thus, the second-order factor (ECV = .648) accounted for between 5 and 81 times as much common subtest variance as any individual WISC-V first-order factor and nearly twice as much common subtest variance as all four WISC-V first-order factors combined. These results are in line with previous studies, which have consistently reported that the greatest portions of common variance are associated with the second-order factor and smaller portions of common variance are apportioned to the first-order factors. This has been documented in studies of the WISC-IV (Bodin, Pardini, Burns, & Stevens, 2009; Canivez, 2014; Keith, 2005; Nakano & Watkins, 2013; Styck & Watkins, 2014; Watkins et al., 2006), as well as in studies of the WISC-V (Canivez et al., 2016, 2017; Fenollar-Cortés & Watkins, 2019; Lecerf & Canivez, 2017; Watkins et al., 2017). When analyzing model-based reliability estimates, the ωHS coefficients for the five first-order factors appeared to be rather low in value, ranging from .043 (FRI) to .456 (PSI), compared with the ωH coefficients for the second-order factor (.798). Construct replicability estimates (H) were also found to be unacceptably low for each of the five group factors, ranging from .064 (FRI) to .625 (PSI), thus indicating that the WISC-V indexes might be extremely limited for measuring unique cognitive constructs (Brunner et al., 2012; Hancock & Mueller, 2001; Reise, 2012; Rodriguez et al., 2016).
It has been concluded that the strong reliability estimates for the FSIQ permit individual interpretation on the superordinate factor level, whereas reliability estimates for each single index were deemed too weak for individual interpretation on the subscale level (Reise, 2012; Reise et al., 2013). Despite measurement invariance, the implication of the present reliability estimation is that primary interpretive weight should be placed on the FSIQ and individual index scores should be at least evaluated with caution, if at all. Although the primary WISC-V indexes may produce scores that are for the most part comparable across gender, it is still indeterminate whether these indexes represent actual attributes that exist outside the Wechsler scales. Since replacing the Perceptual Reasoning factor by separate Visual Spatial and Fluid Reasoning factors has already been shown to be inadequate for several versions of the WISC-V, interpreting WISC-V index scores beyond the FSIQ at least bears the risk of overinterpreting or misinterpreting the true levels of cognitive functioning.
Footnotes
Author’s Note
Monika Daseking is now affiliated with Helmut-Schmidt-University/University of the Federal Armed Forces Hamburg, Hamburg, Germany.
Declaration of Conflicting Interests
The author(s) declared potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Franz Petermann is the editor of the German WISC-V adaptation. The other authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
