Abstract
Measurement invariance of the Wechsler Intelligence Scale for Children, Fifth Edition (WISC-V) 10 subtest primary battery was evaluated across sex, age (6–8, 9–11, 12–14, and 15–16 year-olds), and three diagnostic (attention-deficit/hyperactivity disorder, anxiety, and encephalopathy) groups within a large clinical sample (N = 5359) referred to a children’s specialty hospital. Competing models were tested using confirmatory factor analysis (CFA), and a five-factor oblique model corresponding to the publisher’s hypothesized first-order measurement model (e.g., verbal comprehension, fluid reasoning, visual-spatial, working memory, and processing speed) was found to have the best model fit. Multigroup CFA was subsequently used to evaluate progressively more restrictive constraints on the measurement model. Results indicated that full metric invariance was attained across the three groups studied. Full scalar invariance was attained for sex and diagnostic groups. Partial scalar invariance was attained for age-group. The results of this study provide support for the first-order scoring structure of the five WISC-V factors in the 10 subtest primary battery with this large clinical sample.
Keywords
Measurement invariance is an important but often underutilized aspect of construct validation for cognitive ability instruments. It assists in determining whether the measurement model holds across different groups nested within a broader target population being studied and, ultimately, whether resulting index/composite scores can be confidently interpreted across groups in the same way, if at all (Meredith, 1993). Measurement invariance may be conceptualized as a more technical extension of structural validity. Instead of focusing on a single group as occurs in traditional exploratory and confirmatory factor analysis (CFA) (see, e.g., Dombrowski et al., 2018a, 2018b), it works by constraining selected parameters to equality, permitting a “stress test” on various elements of an instrument’s structure across the groups studied with the ultimate goal of determining whether an instrument’s scores may be compared among those groups. The most common way to assess measurement invariance is through multigroup confirmatory factor analysis (MGCFA; Davidov, Meuleman, Cieciuch, Schmidt, & Billiet, 2014). In MGCFA, models are fitted to data using different sets of constraints corresponding to different levels or types of invariance. Researchers typically differentiate among three levels of measurement invariance that are sufficient for conducting most comparative data analyses: configural, metric, and scalar invariance (Vandenberg & Lance, 2000; see Meredith, 1993 for additional, more restrictive levels that are not commonly applied).
After a plausible baseline measurement model is identified, the first step in MGCFA is to determine whether the pattern of the loadings is the same across groups (configural invariance). Once these constraints are applied and no meaningful attenuation in representative fit statistics is observed, it can be reasonably concluded that the test has equal form (or configuration) across groups. Once equal form is established, the equivalence of the magnitude of the loadings is evaluated (metric invariance). Within the assessment literature, metric invariance is frequently referred to as a form of weak invariance (Putnick & Bornstein, 2016). Finally, the latent intercepts are evaluated to determine whether they are equivalent across groups (scalar invariance). If so, then an instrument is thought to have attained strong invariance (Putnick & Bornstein, 2016), and the factor scores may be confidently interpreted across groups.
Invariance analyses have been conducted on the Wechsler Intelligence Scale for Children, Fifth Edition (WISC-V; Wechsler, 2014a) using the 16 primary and secondary subtest normative sample data to determine equivalence of age and sex (sic., gender) in both US standardization samples (e.g., Reynolds & Keith, 2017; Scheiber, 2016) and international samples (e.g., Chen, Zhu, Liao, & Keith, 2020; Pauls, Daseking, & Petermann, 2019). However, only one invariance study has been conducted on the 10-primary subtest battery. Specifically, Graves, Smith, and Nichols (2020) investigated the invariance of the 10-primary subtest battery in a predominantly African American sample and found that the structure failed to attain full metric invariance. The lack of analyses on the 10-primary subtest battery is noteworthy considering that it contains the most frequently administered group of subtests by practitioners (Benson et al., 2019). Additionally, analyses on referred clinical samples (e.g., attention-deficit/hyperactivity disorder (ADHD), anxiety, and brain injury) are important but rarely investigated (Chen et al., 2020). Children referred for the evaluation of suspected disability are the ones most frequently administered tests of cognitive ability, and there are frequent calls for analyses with clinical samples (Chen et al., 2020; Graves et al., 2020). However, such analyses are less available in the literature. Accordingly, the purpose of this study was to investigate the measurement invariance of the 10 primary WISC-V subtests across sex (male/female), diagnostic group (ADHD, anxiety, and encephalopathy1), and four age groups (6–8, 9–11, 12–14, and 15–16 year-olds) with a large clinical sample.
Method and Data Analyses
Demographic Characteristics of the Clinical Sample.
Diagnostic Categories of the Clinical Sample.
Note. ICD = international classification of diseases, tenth edition; ADD = attention deficit disorder; ADHD = attention-deficit/hyperactivity disorder
Descriptive Statistics for the WISC-V Clinical Sample.
Note. VCI = verbal comprehension index; VSI = visual spatial index; FRI = fluid reasoning index; WMI = working memory index; PSI = processing speed index; WISC-V = Wechsler intelligence scale for children, fifth edition; FSIQ = full scale intelligence quotient.
Model Fit of Competing Models and Separate Groups.
Note. Best index fit in bold. ∆ in comparison to best fit in total sample. Fit of higher order and bifactor five-factor models identical due to constraints imposed to identify bifactor model. The small number of indicators per factor makes it necessary to constrain loadings which, in turn, makes invariance tests of bifactor models not of the actual data but of a constrained version of the data. Bifactor 5-factor score = actual implied scoring structure of WISC-V where 7 subtests load g and 10 load the respective 5 factors. Bifactor 4 score = same as BF 5 score except fluid reasoning and visual-spatial subtests fused to form perceptual reasoning factor. Enceph. = encephalopathy; WISC-V = Wechsler intelligence scale for children, fifth edition. CFI = comparative fit index; TLI = Tucker–Lewis index; SRMR = standardized root mean square residual; RMSEA = root mean square error of approximation; AIC = Akaike information criterion; BIC = Bayesian information criterion.
Results
The comparison of baseline models (Table 4) suggested that an oblique five-factor first-order model, corresponding to the first-order WISC-V measurement structure (see Figure 1), provided the best statistical fit (S-Bχ2 (25) = 222.2, p < .05; comparative fit index (CFI) = .993) to these data. In addition to having the best model fit among the competing models, the oblique five-factor model was selected on the basis that it is the model that guided how the test publisher recommends the instrument is scored and interpreted and subsequently how the majority of psychologists actually interpret the test in practice (Benson et al., 2019). There is another essential point. Although the publisher preferred theoretical model is a conventional higher order model where the influence of general intelligence on the measured variables is fully mediated through the first-order group factors, the scoring structure of the WISC-V primary subtests does not reflect this model: only seven of 10 subtests load the general factor, while all 10 subtests contribute to their respective group factors. Since a general intelligence factor is fully mediated through group factors, this model cannot be readily evaluated using higher order factor analysis. Thus, the test publisher never fully examined its promoted scoring structure for the WISC-V 10-primary subtest battery, nor has it ever been evaluated in the extant literature until this study. A bifactor approach can model this structure albeit in a slightly different form than the bifactor models that have been previously investigated in the literature (see Table 4, “bifactor 5 score” for fit indices results). Wechsler Intelligence Scale for Children, Fifth Edition baseline measurement model identified by the present clinical sample.
Invariance of the Five Oblique WISC-5 Measurement Model—Gender, Age, and Diagnostic Group Samples.
Note. * Scalar versus metric S-B∆χ2 (5) = 99.1, p < .0001. ** Scalar versus metric S-B∆χ2 (15) = 279.2, p < .0001. ***Scalar versus metric S-B∆χ2 (10) = 78.42, p < .0001. WISC-V = Wechsler intelligence scale for children, fifth edition. CFI = comparative fit index; TLI = Tucker–Lewis index; SRMR = standardized root mean square residual; RMSEA = root mean square error of approximation; AIC = Akaike information criterion; BIC = Bayesian information criterion.
Figure weights intercept freely estimated.
Discussion
The investigation of measurement invariance requires consideration of whether an instrument’s factor structure, factor loadings, and intercepts are equivalent across groups when subjected to increasingly restrictive parameter constraints. The evaluation of invariance provides an opportunity to impose a psychometric stress test on the structure of an instrument (thereby further substantiating its structural validity beyond single-group analyses, if attained). With scalar invariance specification, where constraints are placed on intercepts, invariance determines whether an instrument’s scores (in this case, the WISC-V index scores) may be compared and whether resulting scores are not confounded by an artifact of the measurement structure.
The consideration of invariance is important to aid in better understanding the WISC-V 10-subtest primary battery. It has been argued that the WISC-V primary battery theoretical structure was essentially extrapolated from the 16-subtest battery (Dombrowski, Canivez, & Watkins, 2017). Thus, important information regarding the structure of the 10-subtest WISC-V is less available.
The present study evaluated numerous competing models and found that the five-factor oblique model, which also reflects the instrument’s scoring structure, had the best model fit (see Table 4) with this clinical sample. Invariance testing proceeded using this model as baseline. The results suggested that the WISC-V evidenced weak (metric) invariance across all groups (sex [male/female], age [6–8, 9–11, 12–14, and 15–16], and diagnostic group [ADHD, anxiety, and encephalopathy]) investigated. An evaluation of the strong (scalar) specification suggested that both sex and diagnostic groups attained full scalar invariance, while age-groups attained partial scalar invariance. The finding of full or partial scalar invariance was consistent with previous research findings with the extended WISC-V 16 primary and secondary subtest battery (e.g., Chen et al., 2020; Pauls et al., 2019; Reynolds & Keith, 2017; Scheiber, 2016) and other intelligence tests including the Kaufman assessment battery for children, second edition (Reynolds, Scheiber, Hajovsky, Schwartz, & Kaufman, 2015; Scheiber, 2017), Woodcock-Johnson (Edwards & Oakland, 2006; Keith, 1999), and differential ability scale (Keith, Quirk, Schartzer, & Elliott, 1999). However, this is the first study to investigate invariance of the WISC-V 10-primary subtest battery across several groups (e.g., sex, age, and clinical diagnosis) with a referred sample more than double the size of the normative sample. The present study’s conclusions are limited by a lack of comparison to the standardization sample2. This would have offered another vantage from which to assess invariance.
Conclusion and Implications for Practice
The present results have implications for interpretation of the broader measurement model in clinical practice and suggest that individuals from different sex, age, and clinical groups may have their index scores confidently compared to one another (Rudnev, Lytkina, Davidov, Schmidt, & Zick, 2018). This conclusion is particularly important for the clinical comparison group. There have been recent calls for structural validity and invariance analyses within clinical groups (Chen et al., 2020; Graves et al., 2020), but rarely are data sets such as the one in the present study available. In this case, although three distinctly different clinical groups were available—an externalizing disorder (i.e., ADHD), an internalizing disorder (i.e., anxiety), and a neurologically based disorder (i.e., encephalopathy)—the scoring structure was either fully or partially invariant and functions the same way regardless of sex, age, or clinical group. Stated another way, users of the WISC-V 10-primary subtest battery can be more confident that a score obtained on one of the WISC-V indices is a function of performance by a group member and unrelated to statistical distortions in the measurement instrument due to age, sex, or clinical condition. In sum, the attainment of configural, metric, and scalar invariance across three different groups with this clinical sample lends evidentiary support for the viability of the WISC-V first-order factors in clinical practice and suggests that the 10 WISC-V primary subtest battery measures intended first-order constructs with this sample in the way proposed by the publisher.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
