Abstract
Measurement invariance of the Wechsler Intelligence Scale for Children–Fourth Edition (WISC-IV) was investigated with a group of 352 students eligible for psychoeducational evaluations tested, on average, 2.8 years apart. Configural, metric, and scalar invariance were found. However, the error variance of the Coding subtest was not constant across time, allowing only partial strict invariance. This indicates that the WISC-IV (a) was measuring similar constructs at both test occasions, (b) constructs had the same meaning across time, (c) scores that changed across time can be attributed to change in the constructs being measured and not to changes in the structure of the test itself, and (d) measures the same constructs equally well across time with the possible exception of Processing Speed due to the noninvariance of the Coding subtest’s residual variance. This investigation provided support for intelligence as an enduring trait and for the validity of the WISC-IV.
Of all psychological tests, standardized intelligence tests are some of the most widely used by psychologists (Wilson & Reschly, 1996). School psychologists in particular often use standardized intelligence tests as one component of a psychoeducational evaluation for the determination of special education eligibility (Suzuki & Valencia, 1997). Among the available standardized intelligence tests, the Wechsler Intelligence Scale for Children–Fourth Edition (WISC-IV; Wechsler, 2003a) is the most widely used (Strauss, Sherman, & Spreen, 2009). Given that special education eligibility decisions result in relatively long-term placements and may not be beneficial to some children (Morgan, Frisco, Farkas, & Hibel, 2010), strong construct validity evidence is especially important.
As intelligence is thought to be an enduring trait (Hunt, 2011), the WISC-IV should evince similar factor structures over time to ensure that the same traits are being measured with equal accuracy across time (Dimitrov, 2010). Unfortunately, there have only been four longitudinal factor analyses of WISC scores during the past 45 years. In the first, the WISC factor structure was investigated with a sample of 153 preschool-age children who were administered the WISC and followed up 1 year later with another administration of the WISC (Osborne, 1965). Using an exploratory factor analysis (EFA) with varimax rotation, the factor structure changed from preschool to first grade. Specifically, there were 8 factors at the first administration and 10 factors at the second administration. However, this study included children who were not of appropriate age for the WISC. In addition, the methodology of this study is problematic as the subtests were split into two, three, or four parts to create additional variables and the EFA methods were suboptimal (Gorsuch, 2003). Because of these limitations, the results of this study should be regarded with caution. Similar techniques and results were reported by Osborne, Anderson, and Bashaw (1967) for the WISC with the same fatal limitations.
In the third study, the WISC-R factor structure was examined using a longitudinal design with a sample of children (N = 322) eligible for special education services across a span of approximately 3 years (Juliano, Haddad, & Carroll, 1988). This study enrolled children who were identified as either White or Black; other ethnicities were not included. Results indicated that for students who were administered the Digit Span subtest at test and retest (n = 229), a three-factor solution (Verbal, Perceptual, and Freedom from Distractibility) was identified for all groups. Coefficients of congruence were used to quantify similarity between groups, and indicated that the three-factor solution remained stable for children with learning disabilities across the 3-year time span regardless of sex or ethnicity.
The fourth longitudinal factor analysis investigated the factor structure of the WISC-III with 177 students classified as a child with a specific learning disability (SLD), a serious emotional disability (SED), mental retardation (MR), or other disabilities (Watkins & Canivez, 2001). These students were twice administered the WISC-III approximately 3 years apart. Four models were initially evaluated using confirmatory factor analysis (CFA) and the first-order, four-factor model was accepted as the best fitting model for both test and retest occurrences. Test and retest data were also analyzed for invariance of the factor structure across time. Initially, all factor loadings, factor variances, factor covariances, and subtest error variances were constrained to be equal; however, this model had inferior fit in comparison with a baseline model. It was determined that this misfit was likely due to the error variances for three subtests (Vocabulary, Coding, and Arithmetic). Upon releasing those constraints, the model fit was significantly improved. These results indicated that the WISC-III measured the same constructs across time and that the constructs were manifested in the same way across groups.
There have been no investigations of the longitudinal factorial invariance of the WISC-IV. Cross-sectional analysis of the WISC-IV has supported the assumption of longitudinal invariance (Keith, Fine, Taub, Reynolds, & Kranzler, 2006), but cross-sectional analyses may not be adequate for detecting change over time (Willett, Singer, & Martin, 1998). Thus, there is no evidence regarding the factorial invariance of the WISC-IV across time for the same individuals. If longitudinal factorial invariance exists, differences in obtained WISC-IV test–retest scores can be unequivocally attributed to respondents changing on the underlying constructs being measured. In the absence of longitudinal factorial invariance, WISC-IV-obtained test–retest scores cannot be compared because changes in test scores could be due to a myriad of reasons other than changes in the respondents’ standing on the underlying constructs (Dimitrov, 2010). In that situation, the use of WISC-IV scores for identification of children with disabilities would be suspect. Therefore, the current study will use CFA techniques to examine the temporal stability of the factor structure of the WISC-IV in a clinically referred sample.
Method
Participants
Three hundred fifty-two students (66% males) who were twice administered the WISC-IV, with all 10 core subtests administered at each test session, served as participants in the current study. Participant ages ranged from 6.1 to 14.11 years at first testing and 7.5 to 16.6 years at second testing with an average test–retest interval of 2.84 years. Reported ethnic breakdown of the sample was 79% White, 11% Hispanic, 6% Black, and 4% Other. Special education placement was determined by local multidisciplinary evaluation teams following state regulations. Special education diagnosis on initial evaluation included 66% SLD, 9% other health impairment (OHI; attention-deficit/hyperactivity disorder [ADHD]), 8% SED, 5% nonhandicapped, 4% autism, 2% MR, 3% OHI (non-ADHD), and 3% other. To preserve respondents’ privacy, no other information was collected.
Instrument
The WISC-IV is an individually administered intelligence test for children between the ages of 6 and 16 years. The WISC-IV consists of 15 subtests, 10 core and 5 supplemental, each with a mean of 10 and a standard deviation of 3. The 10 core subtests are used to form a Full Scale Intelligence Quotient (FSIQ) score as well as four index scores: Verbal Comprehension Index (VCI; Similarities, Vocabulary, and Comprehension), Perceptual Reasoning Index (PRI; Block Design, Matrix Reasoning, and Picture Concepts), Working Memory Index (WMI; Digit Span and Letter-Number Sequencing), and Processing Speed Index (PSI; Coding and Symbol Search). The FSIQ and index scores have a mean of 100 and a standard deviation of 15.
There has been some debate about the factor structure of the WISC-IV. The technical manual reported that a first-order, four-factor oblique structure fit the core subtests the best (Wechsler, 2003b), mapping onto the VCI, WMI, PSI, and PRI index scores. Others studies have found that a higher order (Keith et al., 2006) or bifactor (Watkins, 2006) general intelligence factor (g) should also be considered, as it explained more of the subtest covariance than any first-order factor (Bodin, Pardini, Burns, & Stevens, 2009; Watkins, 2006, 2010; Watkins, Wilson, Kotz, Carbone, & Babula, 2006).
Procedure
Following Institutional Review Board (IRB) and school district approval, special education files in two participating Southwestern school districts were reviewed and relevant WISC-IV scores were extracted. In total, there were 457 students who were twice administered the WISC-IV. However, only 352 students had complete subtest scores at both test and retest. School district demographics were collected from information provided by the National Center for Educational Statistics (2012). The first district comprised approximately 84% non-Hispanic or Latino students, with 6% of their students identified as English Language Learners. The second district comprised approximately 88% non-Hispanic or Latino students, with 4% of their students’ identified as English Language learners.
Analyses
Model specification
CFA will allow a robust examination of the invariance of the WISC-IV structure across time (Byrne & Stewart, 2006; Millsap & Cham, 2012). When examining factorial invariance, the first step is to determine the baseline factor structure within each testing occasion (van de Schoot, Lugtig, & Hox, 2012). For this study, we tested three models: (a) four oblique first-order factors representing the VCI, PRI, WMI, and PSI; (b) a higher order factor model with one second-order factor and four first-order factors; and (c) a bifactor model with one general factor and four orthogonal domain-specific factors. For bifactor model identification, we constrained the loadings for the WMI subtests to be equal and the loadings for the PSI subtests to be equal.
Testing invariance
Testing invariance across time is similar to testing invariance across groups, except the covariances between like indicator variables’ uniquenesses and common factors across measurement occasions are sometimes included in the model due to domain-specific covariance not accounted for by the factor model (McArdle, 2009). Consequently, Vandenberg and Lance (2000) noted that there are two ways to assess measurement invariance with longitudinal data. The first is to treat the data at the different occasions as if they came from two separate groups and conduct invariance assessment as a typical multigroup model. Although this model is the more parsimonious of the two, it cannot account for correlated residuals or factors across time.
The second approach is to treat the data as if they come from a single sample, similar to traditional repeated measures ANOVAs. This way of assessing invariance posits as many factor models as there are time points, and allows across-occasion covariances for each indicator’s residual variance and each common factor. A disadvantage of this approach is that the input covariance matrix is made up of both the within-occasion and between-occasion covariances, which sometimes results in poor model fit and improper solutions. However, the same levels of invariance are investigated for either the single-sample or multiple-group approach (see Table 1).
Levels of Measurement Invariance.
Determining model fit
Researchers (e.g., Byrne & Stewart, 2006) have suggested two sets of criteria for testing factorial invariance. The traditional perspective examines the change in χ2 (Δχ2) across nested models. If, as the models grow more restrictive, the χ2 values do not significantly change (using a given α level), this is evidence that the more restrictive model fits the data as well as the less restrictive model; thus, the more restrictive (i.e., more parsimonious) model should be favored over the less restrictive one.
The use of χ2 values has been criticized because of their sensitivity to sample size (Byrne & Stewart, 2006). Cheung and Rensvold (2002) and Meade, Johnson, and Braddy (2008) argued that some alternative fit indices (AFIs) are not as susceptible to this problem. Specifically, they found that the comparative fit index (CFI) and McDonald’s (1989) noncentrality index (Mc) were more robust. Thus, the second set of evaluations criteria takes a practical perspective and recommends that invariance be based on two criteria: (a) The multigroup factor model exhibits an adequate fit to the data, and (b) the change in values for AFIs (e.g., ΔCFI, ΔMc) is negligible.
Based on Byrne and Stewart’s (2006) recommendations, this study used two sets of fit indices: one to assess overall model fit and the other to assess change in model fit between two models. As Hu and Bentler (1999) suggested, we used multiple fit indices for both. For this study’s criteria of overall model-data fit, we used the following: (a) root mean square error of approximation (RMSEA) ≤ .08; (b) standardized root mean square residual (SRMR) ≤ .08, and (c) CFI ≥ .96 (Hu & Bentler, 1999; Yu, 2002). To test the change in fit between nested models, we used the ΔCFI and ΔMc (Meade et al., 2008). Cheung and Rensvold (2002) suggested .01 as the threshold for ΔCFI and .02 as the threshold for ΔMc.
For both overall model fit as well as change in model fit, we looked for patterns in the fit statistics and judged acceptance/rejection of the specific model based on the majority of the indices. All analyses were done in R (R Development Core Team, 2012) using the lavaan (Rosseel, 2012) and psych (Revelle, 2012) statistical packages.
Results
Data Inspection
Descriptive statistics for WISC-IV subtest, factor, and IQ scores at test and retest for this referred sample are reported in Table 2 and correlations between subtests at test and retest are provided in Table 3. These results indicate that the current sample exhibited slightly lower and more variable scores than the normative sample of the WISC-IV (Wechsler, 2003b). Similar score patterns have been observed in other clinical samples (Watkins et al., 2006). The univariate score distributions from the current sample appear to be relatively normal across both test administrations (West, Finch, & Curran, 1995). In addition, examination of each variable’s associated histogram indicated that the sample appears to generally follow the shape of a normal distribution. Nevertheless, we used maximum likelihood parameter estimators with standard errors and a mean-adjusted chi-square test statistic that are robust to nonnormality (Satorra & Bentler, 2001).
Descriptive Statistics for Wechsler Intelligence Scale for Children–Fourth Edition (WISC-IV) Scores of 352 Students Twice-Tested for Special Education Eligibility.
Note.IQ = intelligence quotient.
Correlations of WISC-IV Subtests at Test and Retest.
Note. Test correlations are in the upper triangle, Retest correlations are in the lower triangle, and test–retest correlations are on the diagonal. WISC-IV = Wechsler Intelligence Scale for Children–Fourth Edition; VC = Vocabulary; SI = Similarities; CO = Comprehension; BD = Block Design; PCn = Picture Concepts; MR = Matrix Reasoning; DS = Digit Span; LN = Letter-Number Sequencing; CD = Coding; SS = Symbol Search.
Factor Models
Table 4 contains the fit statistics for the three alternative models within each testing occasion. Not unexpectedly, the models fit relatively similarly at both time points (Murray & Johnson, 2013). Chen, West, and Sousa (2006) suggested that when examining invariance, the bifactor model is better than a second-order factor model because the bifactor model allows for tests of invariance of the domain-specific factors as well as the general factor. In contrast, a second-order model only allows for direct tests of invariance for the second-order factor, as the first-order factors are represented by disturbances. Consequently, we chose the bifactor model to use for the invariance assessment. Figure 1 displays the bifactor model. This choice was corroborated by an EFA as per Carroll (1993), which produced similar orthogonal structures.
Fit Statistics for Alternative Baseline Models at Test and Retest.
Note. All statistics based on scaled χ2 statistic and significant using α = .01. CFI = comparative fit index; RMSEA = root mean square error of approximation; SRMR = standardized root mean square residual; Mc = McDonald’s noncentrality index.

Bifactor model with orthogonal domain-specific group factors.
Invariance
The first step in testing the measurement invariance hierarchy was to assess configural invariance using both the single-sample and multiple-group approach ( van de Schoot et al., 2012). Initially, we tested for configural invariance using the multiple-group approach, which does not allow residual or factor variances to covary across time (see Model 1a in Table 5). Although the χ2 value was statistically different than zero, the AFIs indicated that the model fit the data relatively well. Subsequently, we examined configural invariance using the single-sample approach, allowing the common factors and residual variances from the same indicators to covary across the two time points (Model 1b). For all fit indices except Mc, the single-sample model showed a better fit to the data than the multiple-group model. Consequently, we used the single-sample model as our baseline for subsequent tests of invariance.
Fit Statistics for Invariance Models.
Note. All statistics based on scaled χ2 statistic. CFI = comparative fit index; RMSEA = root mean square error of approximation; SRMR = standardized root mean square residual; Mc = McDonald’s noncentrality index.
We next examined metric/weak invariance ( van de Schoot et al., 2012), which constrains the factor loadings between groups (Model 2). This analysis allows factor variances between groups to vary, so we constrained the following loadings for identification: (a) Vocabulary and Coding were constrained to one for the domain-specific factors; (b) Similarities was constrained to one for the general factor; and (c) both loadings for the WM factor and the PS factor were constrained to one because each factor comprised only two subtests. 1 The values for the CFI, RMSEA, and SRMR indices indicated that this model fit the data relatively well. Moreover, the Δχ2, ΔCFI, and ΔMc values indicated that the model did not fit worse than Model 1b using the Cheung and Rensvold (2002) criteria (see Table 6). Substantiation of metric invariance was also obtained from the EFA (Horn & McArdle, 1992; Lorenzo-Seva & ten Berge, 2006), with congruence coefficients that ranged from good (.97) to excellent (.99) according to the guidelines provided by MacCallum, Widaman, Zhang, and Hong (1999).
Change in Fit Statistics for Invariance Models.
Note. See Table 1 for invariance model descriptions. Δχ2 based on scaled difference (Satorra & Bentler, 2001). CFI = comparative fit index; Mc = McDonald’s noncentrality index.
A number of measurement researchers agree that achieving both configural and metric factorial invariance is enough evidence to determine that a measure is invariant across time (Bentler, 2005; Widaman & Reise, 1997) and that further invariance testing is discretionary (Vandenberg & Lance, 2000; Wu, Li, & Zumbo, 2007) or unwarranted (Selig, Card, & Little, 2008). Others believe that strict invariance is required, especially when tests are used for individual decisions (Meredith & Teresi, 2006). Accordingly, this study continued to evaluate measurement invariance by addressing both strong/scalar and strict levels of invariance.
We next examined scalar/strong invariance ( van de Schoot et al., 2012), which constrains the manifest variables’ intercepts between groups but allows the latent variables’ means to differ between groups (Model 3). All the AFIs indicated that the model fit the data relatively well. Moreover, the Δχ2, ΔCFI, and ΔMc values all indicated that the model fit no worse than the metric invariance model.
Next, we tested the latent variables’ variances across test and retest ( van de Schoot et al., 2012). We constrained all the latent variable’s variances to be one and allowed the loadings for WMI and PSI factors to be a value different than one. All the model fit indices indicated that the model fit the data relatively well. The Δχ2, ΔCFI, and ΔMc values all indicated that the model fit no worse than the scalar invariance model.
The strict invariance model, which constrains the residual variances across groups ( van de Schoot et al., 2012), was tested next (Model 5). The SRMR and RMSEA indicated that the model fit the data relatively well, but the Δχ2, ΔCFI, and ΔMc values indicated that that the model fit worse than the previous model. Thus, the model does not appear to have complete strict invariance across time. An examination of the residual variances found the Coding subtest to be most disparate. We removed the equality constraints for the Coding subtest and refit the model (Model 5a). All the model fit indices indicated that the revised model fit the data relatively well. The Δχ2, ΔCFI, and ΔMc values all indicated that the model fit no worse than the prior test of latent variances. Thus, the model exhibited partial strict invariance, indicating that any differences between the means and variances of the WISC-IV subtests was due solely to differences in the constructs that they measure. Thus, with the exception of the Coding subtest, “all group differences on the measured variables are captured by, and attributable to, group differences on the common factors” (Widaman & Reise, 1997, p. 296). The final model is illustrated in Figure 1.
Discussion
The goal of the current study was to investigate measurement invariance of the WISC-IV for a group of 352 students eligible for psychoeducational evaluations tested, on average, 2.8 years apart. Using CFA methods, the bifactor model exhibited partial strict invariance across time, with the error variance of the Coding subtest being the only residual variance that differed across time.
Verification of configural invariance indicates that the same factor structure was maintained across time. Thus, there was the same number of latent variables, indicator variables, and pattern of fixed and estimated parameters at both test and retest. This indicates that the WISC-IV was measuring similar constructs at both test and retest occasions. Configural invariance is considered to be the least restrictive test of similarity of factors across time (Dimitrov, 2010).
The achievement of metric invariance means that corresponding factor loadings (i.e., pattern coefficients) were equivalent across time. That is, each subtest loaded equivalently on its respective factors at both test and retest occasions. Thus, the constructs being measured were equivalent at both test and retest. This provides evidence that the observed WISC-IV scores (e.g., FSIQ, VCI, PRI, etc.) were assessing factors of the WISC-IV (e.g., g, VC, PR, etc.) in the same way at both test and retest (Horn & McArdle, 1992; Wu et al., 2007).
Attaining scalar/strong invariance indicates that factor means and variances can be compared across time (Dimitrov, 2010). Therefore, any change in observed WISC-IV test scores (e.g., FSIQ, VCI, PRI, etc.) across time can be attributed to change in the constructs being measured (e.g., g, VC, PR, etc.) and not to changes in the structure of the test itself. Thus, students with the same ability at either test occasion achieved the same manifest scores on the WISC-IV, allowing valid comparisons of mean scores and correlations across groups (Horn & McArdle, 1992).
A model with partial strict invariance indicates that the latent variables the WISC-IV is measuring, with the possible exception of Processing Speed due to the noninvariance of the Coding subtest’s residual variance, were measured with equal precision at both test occasions. The error variance of the WISC-III Coding subtest was also found to lack longitudinal invariance (Watkins & Canivez, 2001). Thus, differences in WISC-IV obtained scores across time (with the exception of Coding) were due to differences in their latent means (Dimitrov, 2010). This supports the hypothesis that the WISC-IV measures the same constructs equally well across time.
Limitations
As with all research, there are a number of limitations in the current study. The greatest of these limitations is the sample. Although a sample of 352 students is typically considered to be large, this is a relatively small sample for factorial invariance testing of complex structures. Ideally, a larger sample is desired when completing these types of analysis (Byrne, 2012). In addition, the sample used in this study was from two school districts in one Southwestern state and thus may not be generalizable to other regions. Furthermore, the sample consisted solely of students twice referred for a psychoeducational evaluation for special education eligibility. The characteristics that resulted in two WISC-IV administrations may have been unique. A final limitation of this study is the method of data collection. As the data was collected from archived special education records, administration and recording accuracy of the individual psychologists who administered the WISC-IV had to be assumed.
Conclusion
Although the longitudinal structural stability of the WISC-IV has not previously been investigated, cross-sectional measurement has found it to be consistent across ages 6 to 16 years (Keith et al., 2006). Likewise, the temporal stability of other cognitive ability test scores has been demonstrated with children (Watkins & Canivez, 2001) as well as adults (Reeve & Bonaccio, 2011). The current study demonstrated that changes in WISC-IV scores across time can be attributed to change in the constructs being measured and not to change in the structure of the test itself. These results provide support for intelligence as an enduring trait (Hunt, 2011) and for the validity of the WISC-IV. However, obtained factor index scores are not pure measures of their underlying constructs because each obtained index score is influenced by g as well as error. For example, about 60% of the variance in the VCI score is due to g (Schneider, 2013). This complex relationship between latent and obtained scores should be considered when interpreting WISC-IV subtest, index, and full scale scores (DeMars, 2013).
Footnotes
Acknowledgements
The contributions of Dr. John Balles and Dr. Christa Lynch are gratefully acknowledged.
Authors’ Note
This study is based on the dissertation of the first author. Dr. Richerson is now with the Scottsdale Unified School District, Scottsdale, Arizona.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
