Abstract
The fourth edition of the Wechsler Adult Intelligence Scale (WAIS-IV) is a revised and substantially updated version of its predecessor. The purposes of this research were to determine the constructs measured by the test and the consistency of measurement across large normative and clinical samples. Competing higher order WAIS-IV four- and five-factor models were analyzed using the WAIS-IV’s sample of 1,800 normative adults and 411 clinical adults. When all 15 WAIS-IV subtests were considered, both four- and five-factor models were suitable, but the five-factor model provided a better fit. The WAIS-IV PRI differentiated into two composites as follows: POI(Gv) consisting of Block Design, Visual Puzzles and Picture Completion; and FRI(Gf) consisting Matrix Reasoning, Arithmetic and Figure Weights. The five-factor solution included Quantitative Reasoning (RQ), consisting of Arithmetic and Figure Weights, as a narrow ability subsumed under FRI(Gf). Arithmetic, Vocabulary, and Figure Weights subtests had the highest g loadings. Cancellation had the lowest g loading. The WAIS-IV generally demonstrated full factor invariance between clinical and nonclinical samples.
Wechsler tests are by far the most popular tests of intelligence in the world (Camara, Nathan, & Puente, 2000; Georgas, van de Vijver, Weiss, & Saklofske, 2003; Lichtenberger & Kaufman, 2009). The fourth edition of the Wechsler Adult Intelligence Scale (WAIS-IV; Wechsler, 2008a, 2008b) is a revised and updated version of its predecessor. It provides a general intelligence composite score and four scores corresponding to first-order factors: Verbal Comprehension (VCI), Perceptual Reasoning (PRI), Working Memory (WMI), and Processing Speed (PSI).
Seventy years ago, Wechsler (1939) included aspects from two primary theories of intelligence, the views of Spearman and Thorndike, and defined intelligence in practical terms as
the aggregate or global capacity of the individual to act purposefully, to think rationally, and to deal effectively with his [or her] environment. It is global because it characterizes the individual’s behavior as a whole; it is an aggregate because it is composed of elements or abilities which, though not entirely independent, are qualitatively differentiable. (Wechsler, 1939; 1944, p. 3)
Wechsler clearly acknowledged the existence of a global intelligence and that global intelligence is composed of qualitatively different abilities. Wechsler’s foresight is indeed relevant and practical in light of contemporary views on intelligence, which generally support a hierarchical model of cognitive abilities. General intelligence (g) tends to emerge whenever a sufficient number of cognitively complex variables are analyzed (Carroll, 1993). During Wechsler’s lifetime, he expressed the concern that measuring abilities that are too narrow could be impractical, and instead embraced the use of broader domain or composite scores that are supported by both factor-analytic research and clinical utility (Cohen, 1957, 1959; Kaufman, 1975; Wechsler, 1958). The various tasks of cognitive ability that he selected, including measures of verbal comprehension, perceptual organization, working memory, and processing speed—although not known by these names at the time—continue to align with current research into the structure of cognitive abilities (see, for example, Kane, Hambrick, & Conway, 2005) and recent theories of intelligence (Carroll, 1993; Sternberg, 2000).
Among the various cognitive theories, the Cattell-Horn-Carroll model (CHC; Carroll, 1993, 2005; Schnieder & McGrew, in press) is often considered a suitable framework for exploring and comparing the nature of cognitive instruments (Keith & Reynolds, 2010). Resembling Wechsler’s hierarchical view but with more expanded details, the CHC model locates cognitive abilities into three structural levels. On top of the CHC hierarchy is g. In the middle level are 7 to 10 broad abilities [i.e., crystallized intelligence (Gc), fluid intelligence (Gf), quantitative knowledge (Gq), short-term memory (Gsm), long-term retrieval (Glr), visual processing (Gv), auditory processing (Ga), processing speed (Gs), reading and writing ability (Grw), and decision-reaction time-speed (Gt)]. There are more than 70 narrow abilities on the base level, and extension of the model is ongoing (McGrew, 2005; McGrew & Flanagan, 1998; Schneider & McGrew, in press).
Wechsler’s tests have gone through progressive revisions to update the theoretical foundations based on cumulative research findings in neuropsychology, executive functions, and working memory (Coalson, Raiford, Saklofske, & Weiss, 2010; Weiss, Saklofske, Coalson, & Raiford, 2010a). In particular, the most notable change involves the evolution of the composites/index scores. The emergence of the four-index structure (verbal comprehension, perceptual organization, working memory, and processing speed) can be traced to the WISC-III (Wechsler, 1991). Since fluid reasoning has been emphasized as a key aspect of cognitive functioning (Carroll, 1997; Cattell, 1943, 1963; Cattell & Horn, 1978; Sternberg, 1995, 2000), one of the continuous efforts made has been to incorporate new subtests for enhancing the scale’s measurement capabilities of fluid reasoning, including new subtests such as Matrix Reasoning (Wechsler, 1997), Picture Concepts (Wechsler, 2002, 2003), and Figure Weights (Wechsler, 2008b). The increased emphasis on fluid reasoning was also represented by renaming of the Perceptual Organization Index as the Perceptual Reasoning Index in the WISC-IV (Wechsler, 2003) and WAIS-IV (Wechsler, 2008a). The PRI is currently a measure of fluid reasoning, spatial processing, attentiveness to detail, and visual–motor integration (Wechsler, 2008a). As the Wechsler tests continue to evolve, further changes to PRI seem relevant. That is, is there a need to further differentiate the components inside PRI? Would it be clinically advantageous to regroup PRI into two separate factors: fluid reasoning and perceptual organization?
Several studies have explored the WAIS-IV factor model. Bensen, Hulac, and Kranzler (2010) argued for the clear superiority of a CHC derived five-factor model in which Matrix Reasoning, Figure Weights, and Arithmetic loaded on a fifth factor they termed fluid reasoning. Ward, Bergman, and Herbert (2011) proposed a modification of the WAIS-IV four factor model in which residual factors were formed for visual-spatial organization (Block Design, Visual Puzzles, and Picture Completion) and quantitative reasoning (Figure Weights and Arithmetic). Ward and colleagues argued that the modified model was theoretically consistent with the original WAIS-IV model and that there was no compelling statistical reason to prefer the Bensen et al. five-factor model over a modified four-factor model. Conceptually, such a model is equivalent to one with correlated first-order visual-spatial and quantitative reasoning factors.
When comparing models, another important validity issue is measurement invariance. Invariance is a fundamental property for any measure (Drasgow, 1984, 1987; Horn & McArdle, 1992; Vandenberg & Lance, 2000). It assumes that the test measures the same constructs in different groups. Meaningful comparisons can only be made if the measures are comparable (Chen, Sousa, & West, 2005). In empirical settings, the Wechsler Intelligence Scales are often used as part of a diagnostic assessment (Rabin, Barr, & Burton, 2005; Weiss, Saklofske, Coalson, & Raiford., 2010b). Implicit in this common practice is the assumption that WAIS-IV index scores and subtests results have the same meaning for adults in both normative and clinical populations. That is, equivalence is assumed to hold for underlying theoretical structures, factor patterns, and magnitudes of factor loadings, and the same subtest intercepts given the same latent means for the underlying factors. Measurement equivalence of WAIS-IV across large normative and clinical samples has never been reported.
This research employed a large sample with a large degree of variation to investigate construct validity of the WAIS-IV. The purpose of this study was threefold. First, the constructs underlying the WAIS-IV were investigated by comparing its current four-factor structure to a hypothesized five-factor structure. Second, abilities measured by subtests and possible cross-loadings were tested and verified for both structures. Finally, measurement invariance was evaluated to test whether the 15 subtests in WAIS-IV measure latent abilities in the same way for adults in the normative sample and those in the clinical sample. We used WAIS-IV data from the normative and clinical samples to examine invariance of these models with a multigroup higher order confirmatory analysis of mean and covariance structures (MG-MACS). This analysis will subsequently be referred to as the “clinical invariance” analyses.
Method
Participants
We used the WAIS-IV standardization sample as our nonclinical sample. It includes the responses of 1,800 normal adults 16 to 69 years old. Standardization cases from age 70 to 90 were excluded because they did not take all 15 WAIS-IV subtests. This normative sample was carefully selected by the WAIS-IV to match the U.S. 2005 census for demographics, such as region, gender, education, and ethnicity. The overall mean Full-scaled IQ (FSIQ) was 100 (SD = 15). For all 15 subtests, ranges for means were 9.97 to 10.09, standard deviations, 2.88 to 3.12, skewness −0.27 to 0.56, and kurtosis −0.53 to 1.94. Mean age was 37.2 (SD = 17.3). A more detailed description of this normative sample is in the WAIS-IV technical manual (Wechsler, 2008b).
We also used the WAIS-IV clinical sample. This heterogeneous group included 411 adults with various clinical diagnoses. These adults had mild or moderate intellectual disability (25.3%); borderline intellectual functioning (6.6%); reading or math learning disabilities (18.2%); ADHD (10.7%); traumatic brain injury (5.4%); autistic or Asperger’s Disorder (13.6%); major depressive disorder (9.0%); mild cognitive impairment (4.1%); probable Mild dementia or Alzheimer’s (1.9%); and other disabilities (5.2%). The mean FSIQ in the overall clinical sample was 80.7 (SD = 20.4). For all 15 subtests, descriptive statistics were means, 6.39 to 7.75; standard deviations, 3.28 to 3.87; skewness, 0.10 to 0.58; and kurtosis, −0.98 to 0.30. The average age (31.3) was similar to that of the normative sample (SD = 16.5). A detailed description of this clinical sample is in the WAIS-IV technical manual (Wechsler, 2008b).
Instrumentation
The WAIS-IV has 10 core subtests (Similarities [SI], Vocabulary [VC], Information [IN], Block Design [BD], Matrix Reasoning [MR], Visual Puzzles [VP], Digit Span [DS], Arithmetic [AR], Coding [CD], Symbol Search [SS]), and five supplemental subtests (Comprehension [CO], Figure Weights [FW], Picture Completion [PC], Letter-Number Sequencing [LN], and Cancellation [CA]).
Analysis
We analyzed all subtests of the WAIS-IV. Tests for the higher order confirmatory factor structure were based on analysis of mean and covariance structure models using LISREL 8.8 (Jöreskog & Sörbom, 2006). Both four- and five-factor models with hypothesized cross loadings were tested individually.
In addition to testing a higher order g, the initial four-factor structure specified four verbal comprehension subtests (SI, VC, CO, IN) on the first factor, five perceptual reasoning subtests (BD, MR, VP, FW, PC) on the second factor, three working memory subtests (DS, LN, AR) on the third factor, and three processing speed subtests (CD, SS, CA) on the fourth factor. This second-order model was defined as the initial four-factor model (Model A1 in Table 1).
Hypotheses Testing the Four-Factor Structure.
Note. aCompared to the Model A1, unless otherwise noted. bCompared to Model C1.
Source: Data and table copyright Pearson 2013.
The initial five-factor model specified a higher order g and five first-order factors. This initial model was based on evaluation of subtest content and previous research (Benson, Hulac, & Kranzler, 2010; Keith et al., 2006; Lichtenberger & Kaufman, 2009). This model specified four subtests (SI, VC, CO, IN) on the verbal comprehension factor (VCI/Gc), three subtests (BD, VP, PC) on the perceptual organization factor (POI/Gv) factor, three subtests (MR, FW, AR) on the fluid reasoning factor (FRI/Gf), two subtests (DS, LN) on the working memory factor (WMI/Gsm), and three subtests (CD, SS, CA) on the processing speed factor (PSI/Gs). The five-factor model includes both Wechsler-related and CHC-related names to allow comparability. Compared to the four-factor model, this five-factor model split the five perceptual reasoning subtests into fluid reasoning and perceptual organization factors and specified the Arithmetic subtest on the fluid reasoning factor (Model A1 in Table 2).
Hypotheses Testing the Five-Factor Structure.
Compared to the Model A1, unless otherwise noted. bCompared to Model C1. CVCI(Gc); FRI(Gf); POI(Gv); WMI(Gsm); PSI(Gs).
Source: Data and table copyright Pearson 2013.
Alternative competing models of both structures were also tested. Possible cross-loadings and correlated errors were investigated for selected subtests thought to measure multiple abilities or that showed split loadings in previous studies. Parameters statistically nonsignificant or having values too trivial to be practically meaningful were deleted. For all analyses, a calibration-validation approach was used where two thirds of the normative sample (n = 1,200) was randomly selected as the calibration sample to test hypotheses and the remaining third (n = 600) was used to cross-validate the results of calibration analyses. Any statistically significant factor loading greater than .10 was explored in the calibration phase, and the cut point was set at .20 in the cross-validation phase to reduce the loadings to nontrivial ones only. Once a best-fitting solution from each of the four- and the five-factor models was calibrated and validated, final parameters were retested using the entire sample (n = 1,800).
Clinical invariance of each final validated model was examined. Six levels of nested models were tested to investigate the degree of invariance. Each level had more constraints than the previous level (Byrne & Stewart, 2006; Chen et al., 2005; Keith & Reynolds, in press; Meredith, 1993). The initial and weakest level was configural invariance. It assumed the same number of factors and the same overall factor pattern across groups. The second level was first-order factor loading invariance, also called metric (or weak factorial) invariance. Loadings of subtests on factors were constrained so that factor loadings were equal across groups. When the factor loadings are equal, scales of latent variables are the same for both groups and the unit of measurement is identical. That is, for each unit change in latent variable, scores on subtests change by the same amount in both groups. The third level was intercept invariance, or also known as scalar (strong factorial) invariance. In this level of invariance, any group differences in subtest means are a result of true mean differences in latent factors. Subtests have the same intercepts across groups given the same latent means for an underlying factor. To examine whether “all group differences on the measured variables are captured by, and attributable to, group differences on the common factors” (Widaman & Reise, 1997, p. 296), we tested invariance of residuals, also called strict factorial invariance. These residuals are a combination of subtest-specific unique variance and random measurement errors. The fifth level was second-order factor loading invariance. This level assumed first-order latent factors show the same amount of change in each group for the same amount of increase in g. Finally, we tested for invariance of disturbances (factor unique variance) of first-order factors. Although residual/disturbance invariance is not fundamentally crucial for measurement invariance, it provides substantial information about human cognitive abilities across groups. The scale of latent factors was identified by fixing a factor loading of each factor to one.
Multiple indices of model fit were used to evaluate and compare the various models in this study (Bentler & Bonett, 1980; Hu & Bentler, 1998, 1999; Kline, 2005; Marsh, Balla, & McDonald, 1988). Single models were evaluated using the comparative fit index (CFI), root mean square error of approximation (RMSEA), and standardized root mean square residual (SRMR). An RMSEA less than .05 corresponded to a good fit, and .08 was considered an acceptable fit (McDonald & Ho, 2002). For completeness, we included the 90% confidence interval for RMSEA. SRMR values less than .08 were considered acceptable (Hu & Bentler, 1999). A value of .95 served as the cutoff for acceptable fit on all indices ranging from zero to 1, with 1 indicating a perfect fit (Hoyle & Panter, 1995; Hu & Bentler, 1999; Kline, 2005). Change in chi-square (Δχ2) was used to evaluate competing, nested models (Bentler & Bonett, 1980). The Akaike information criterion (AIC) and sample size adjusted Bayesian Information Criterion (aBIC) were also used for comparisons of nonnested models (Kaplan, 2000; Loehlin, 2004), with smaller values indicating a better fit. Comparatively, aBIC has a greater reward for parsimony than does the AIC.
To determine evidence of invariance, since there is little consensus concerning the most appropriate criterion (Byrne & Stewart, 2006), two perspectives were evaluated for invariance analyses: (a) the traditional perspective based on Δχ2 and (b) the practical perspective based on differences in comparative fix index (ΔCFI). When evaluating the traditional perspectives, given the large sample and the number of comparisons being made, we used a more strict definition of statistical significance of Δχ2 (p < .001). Comparatively, the Δχ2 test is known to be sensitive to sample size and moderate discrepancies from normality (Kline, 2005; West, Finch, & Curran, 1995). Therefore, Cheung and Rensvold (2002) recommended ΔCFI as superior to Δχ2 for tests of invariance because it is independent of both model complexity and sample size and because it is not correlated with the overall fit measures. “A value of ΔCFI smaller than or equal to –.01 indicates that the null hypothesis of invariance should not be rejected” (Cheung & Rensvold, 2002, p. 251).
Results
The Most Appropriate Four-Factor Structure
All examinations of the four-factor structure are shown in Table 1. In the calibration phase, goodness-of-fit indexes reported for the initial four-factor model (Model A1) were within the acceptable range, showing the four-factor structure fit data well. Each hypothesis was checked individually for cross-loadings and error correlations to determine whether subtests were pure or mixed measures of the four latent factors. Six originally unspecified parameters were found to yield statistically significant improvement in model fit: Arithmetic was found a mixed measure of WMI, PRI, and VCI. Loadings of Arithmetic on each of these three factors were .36, .29, and .20, respectively. The Δχ2 between Model A5 and A1 was statistically significant, suggesting that, allowing Arithmetic to have cross-loadings on all three factors substantially improved model fit. Three other subtests also showed significant cross-loadings: Similarities loaded mainly on VCI (.74) and slightly on PRI (.10); Figure Weights was cross-loaded on PRI (.51) and WMI (.29); Coding mainly measured PSI (.69) but also some WMI (.12). Errors of Arithmetic and Figure Weights were found to show a trivial, but statistically significant, correlation (.10).
Validation analyses tested all six parameters in Model B1, but only Arithmetic showed statistically and practically meaningful cross-loadings (Table 1). Arithmetic was cross-loaded on WMI, PRI, and VCI (factor loadings were .39, .20, and .25 respectively). PRI showed the highest g loading across four first-order factors.
This validated model with cross-loaded AR was retested using the entire sample of 1,800 adults (Model C2). This validated model had a far superior fit compared to the initial four-factor model (Model C1). Goodness-of-fit indexes for both models were within ideal range.
Standardized estimates for Model C2 are shown in Figure 1. All 15 subtests loaded strongly on corresponding factors. Consistent with literature, Arithmetic was a mixed measure of WMI, PRI, and VCI (factor loadings were .36, .27, and .22 respectively). Across all four first-order factors, PRI had the highest g loading (.91). All parameter estimates were reasonable and theoretically sound.

Standardized estimation for the final validated four-factor model.
The Most Appropriate Five-Factor Structure
Examinations of five-factor models are shown in Table 2. In the calibration phase, the initial five-factor model (Model A1) showed a good fit to the data, suggesting it was an appropriate model for interpreting WAIS-IV results. Each hypothesis of cross-loadings and related errors was individually tested before being cross-validated in combination. Calibration analyses supported separating FRI(Gf) and POI(Gv). Arithmetic was a mixed measure of FRI(Gf), WMI(Gsm), and VCI(Gc) (factor loadings were .39, .30, and .15 respectively). Four other subtests showed significant split-loadings above the .10 cutoff established for the calibration phase: Similarities on VCI(Gc) (.79) and FRI(Gf) (.11); Figure Weights on FRI(Gf) (.58) and POI(Gv) (.21); Matrix Reasoning on FRI(Gf) (.55) and POI(Gv) (.22); and Cancellation on PSI(Gs) (.48) and POI(Gv) (.11). Allowing a narrow ability Quantitative Reasoning (RQ) under FRI improved the model fit as well (cf. Ward et al., 2011).
When validated in combination and allowing a narrow ability RQ under FRI(Gf), only Arithmetic, Figure Weights, and Matrix Reasoning showed robust split loadings. When tested with the entire sample (n = 1,800), goodness-of-fit indexes for Model C2 suggested a good fit to the nonclinical adults (CFI = .99, RMSEA = .049, SRMR = .033). With four extra parameters allowed, Model C2 showed significant improvement compared to the initial five factor model (Model C1) in which no cross loadings were specified, Δχ2 = 118.34(4), p < .0001.
Standardized estimates for Model C2 are shown in Figure 2. All 15 subtests loaded strongly on corresponding factors. Three subtests showed salient split loadings. Arithmetic was, as expected, a complicated measure of Quantitative Reasoning (RQ) and WMI(Gsm) (factor loadings were .62, and .23 respectively). Figure Weights measured both RQ and POI(Gv) (factor loadings were .53 and .29 respectively). Matrix Reasoning mainly measured FRI(Gf) and some POI(Gv) (factor loadings were .48 and .27 respectively). The highest g loading was on FRI(Gf) (.99).

Standardized estimation for the final validated five-factor model.
Which Fit Better: Four- or Five-Factor Model?
Since values of Models C1 and C2 in Table 1 and Models C1 and C2 in Table 2 were within the ideal range, both four-factor and five-factor models provide a good fit to the data. Therefore, either of these models could serve as a basis for WAIS-IV score interpretation. Side-by-side comparison of two structures revealed the five-factor model showed a better fit. When comparing initial models (Model C1) from both sets, the aBIC values for four- and five-factor C1 models were 975.29 and 700.83, respectively. When cross-loadings were relaxed, the five-factor Model C2 still showed an overall better fit than the four-factor Model C2 (aBIC values were 793.76 and 599.78, respectively).
For these two C2 Models with cross-loadings, subtest loadings on g are shown in Table 3. Estimated values from both models were similar. Arithmetic, Vocabulary, and Figure Weights had the highest loadings on g. The Cancellation subtest had the lowest loading on g.
Loading of WAIS-IV Subtests on the Second-Order g Factor.
Source: Data and table copyright Pearson 2013.
Invariance Between Normative and Clinical Samples in Validated Four-Factor Model
The relatively large variation in sample sizes between the normative and clinical samples would result in excessive power in the standardization group. To prevent findings from being so heavily weighted toward the normative sample, we compared the clinical group (n = 411) with a randomly selected representative sample (n = 411) from the normative group.
Clinical invariance analyses for the final cross-validated four-factor model (Model C2 in Table 1) are reported in Table 4. Variance-covariance matrices were first constrained to be equal across normative and clinical samples (Model 1). This constrained model fit the data adequately (CFI = .99; RMSEA = .057), suggesting fairly invariant WAIS-IV subtest covariance patterns in adults. Since any factor structure is derived from these variance-covariance matrices, if the WAIS-IV measures the same constructs, factor structure between the normative and clinical samples should be similar.
Invariance Analyses of Four-Factor Model (Model C2 in Table 1) of Normative and Clinical Adults.
Note. Compare model fit with previous model, unless noted otherwise.
Source: Data and table copyright Pearson 2013.
First, the configural model (Model 2) provided an acceptable fit to the data. 1 Normative and clinical adults shared the same WAIS-IV first- and second-order four-factor patterns and corresponding subtests loaded onto the same factors. With the factor pattern established, we imposed cross-group constraints on first-order factor loadings. The fit was not substantially reduced (Model 3). The ΔCFI was 0. Using a .001 criterion, the Δχ2 did not fit significantly worse than the configural model. This means that the subtests measure the same latent factors in both groups. Next, we constrained subtest intercepts to be equal. To properly identify this model, the means of first-order factors in the normative group were fixed to zero, but those in the clinical group were freed. Thus factor means for the clinical group represent mean differences. All corresponding first-order intercepts were constrained to be equal. The addition of intercept constraints significantly reduced fit according to the Δχ2 but not according to ΔCFI and other indices. Sources of misfit were the intercepts of the Symbol Search and Block Design subtests. When these two parameters were freed, model fit improved substantially and no longer differed significantly compared to the first-order factor loading invariant model. Next, when subtest residuals and structural parameters (second-order loadings, first-order unique variances) were constrained as equal between groups in steps. There was no result in deterioration of fit. In fact, aBIC generally improved, suggesting these steps achieved measurement invariance.
The small size of the indicators of noninvariance for Symbol Search and Block Design should be interpreted in light of the complexity of the model and strictness of the test. We conclude that the WAIS-IV shows acceptable levels of invariance among factors between the normative and clinical groups. Differences in subtest scores on the WAIS-IV are generally due to latent constructs, and the test is not biased by clinical state 2 . Those who take a more strict approach may consider that the group differences in means of the Symbol Search and Block Design subtests are not fully attributed to the mean differences in the model-specified latent factors. These differences could be a result of unmodeled minor factors. Estimated Symbol Search intercepts were 10.10 and 10.95 for normative and clinical adults, respectively. Estimated Block Design intercepts were 10.07 and 10.39, respectively. Given the same levels of Processing Speed and Perceptual Reasoning, the clinical group scored slightly higher on these two intercepts, showing that these tasks are slightly easier for this group than would be expected based on their Processing Speed and Perceptual Reasoning abilities. Again, these differences may be the result of these subtests measuring different narrow abilities compared to other subtests on these factors.
Invariance Between Normative and Clinical Samples in the Validated Five-Factor Model
Following a similar procedure, factorial invariance of the final validated five-factor model (Model C2 in Table2) was assessed. Fits of the various models are shown in Table 5. Evaluation of various indices, such as ΔCFI and aBIC, generally supported full measurement invariance. The five-factor model demonstrated good levels of factorial invariance between normative and clinical adults. 3 When the more strict Δχ2 criterion was considered, only one parameter was revealed as noninvariant: the Symbol Search intercept was freely estimated as 9.99 and 10.62 for the normative and clinical samples, respectively. Again, the clinical group was slightly higher on this intercept. Group difference on the Symbol Search subtest might not be fully attributed to the mean differences in the model specified latent factors.
Invariance analyses of five-factor model (Model C2 in Table 2) of normative and clinical adults.
Compare model fit with previous model, unless noted otherwise. bThe model could not be estimated without an additional constraint. We fixed the latent mean of the PSI factor to the value found in the precious model.
Source: Data and table copyright Pearson 2013.
Primary and secondary interpretations of each WAIS-IV subtests are shown in Table 6 based on the pattern of factor loadings observed.
Primary and Secondary Abilities Measured by WAIS-IV Subtests.
Source: Data and table copyright Pearson 2013.
Discussion
This study is important because it is the first study to evaluate the clinical validity of an alternative five-factor measurement model for WAIS-IV. The four-factor interpretive approach is the model published with the test’s technical manual (Wechsler, 2008). The five-factor model reorganizes the 15 WAIS-IV subtests into five factors with the additional factor measuring fluid reasoning.
The first major set of findings in this study is that both four- and five-factor models fit the data well and provide meaningful strategies for interpreting WAIS-IV scores. Fit statistics were within the ideal range for both models. Comparatively, the five-factor model fit better, providing psychometric support for separating the WAIS-IV PRI into two composites as follows: POI(Gv) consisting of Block Design, Visual Puzzles, and Picture Completion; and FRI(Gf) consisting Matrix Reasoning, Arithmetic, and Figure Weights. Our five-factor solution included Quantitative Reasoning (RQ), consisting of Arithmetic and Figure Weights, as a narrow ability subsumed under FRI(Gf).
The second and most important set of finding is that both 4 and 5 factor models derived from the normative data provide a good fit to the clinical data. Invariance analyses generally supported invariance of both models between normative and clinical groups. The WAIS-IV 15-subtests demonstrated the same underlying theoretical latent constructs, strength of relations among factors and subtests, validity of each of the first-order factors, and the same subtest intercepts and communalities regardless of clinical status. Thus each model demonstrated nearly full factorial invariance between clinical and nonclinical samples. This means that whichever model is used, the WAIS-IV subtests have the same meaning regardless of clinical status.
The clinical sample scored slightly higher on both the Symbol Search and Block Design subtests in the four-factor model, and slightly higher on Symbol Search in the five-factor model. Although the strict criterion identified these two subtest intercepts as noninvariant in the clinical sample, the small magnitude of the differences identified likely have little clinical meaning given model complexity and the strictness of parameter constraints, and do not jeopardize overall WAIS-IV invariance. One probable reason for the finding is that these subtests also measure some degree of narrow abilities not modeled by the factor structures shown. In practice, however, factor loading invariance is often seen as the most important step in invariance testing, whereas complete intercept invariance is hard to fulfill (Reynolds & Keith, in press) and subtest intercept variation does occur (Cooke, Kosson, & Michie, 2001; Immekus & Maller, 2010; Maller & French, 2004). Thus we believe that this variation does not preclude the usefulness of these subtests for measuring the proposed latent abilities in clinical samples.
A third set of major findings concerns clarification of multiple abilities measured by some subtests as evidenced by cross loadings on more than one factor, and confirmation of subtest g loadings. As expected, the Arithmetic, Figure Weight, and Matrix Reasoning subtests measured multiple abilities. Arithmetic, Vocabulary, and Figure Weights had the highest g loadings. Cancellation had the lowest g loading.
The Arithmetic subtest likely requires examinees to integrate a complex mix of abilities in both adult (Benson et al., 2010; Bowden, Saklofske, & Weiss, 2011; Bowden, Weiss, Holdnack, & Lloyd, 2006; Lichtenberger & Kaufman, 2009; Ward et al., 2011) and child populations (Chen, Keith, Chen, & Chang, 2009; Keith et al., 2006). In this study, Arithmetic loaded mainly on working memory (.36), with some perceptual reasoning (.27) and verbal comprehension (.22) in the four-factor model. In the five-factor model, Arithmetic mainly measured quantitative reasoning (.62), which is a narrow ability under fluid reasoning, and some working memory (.23). These results are consistent with empirical findings presented by Keith, who found loadings for Arithmetic on Gf, Gsm, and Gc as .34, .31, and.19, respectively (Lichtenberger & Kaufman, 2009, p. 32).
Our findings are also consistent with current research into the theoretical structure of intelligence, which documents considerable shared variance between working memory and fluid reasoning (Conway, Cowan, Bunting, Therriault, & Minkoff, 2002; de Jong & Das-Smaal, 1995; Engle, Tuholski, Laughlin, & Conway, 1999; Fry & Hale, 1996, 2000; Kane et al., 2005). Specifically, the cognitive control mechanisms involved in working memory have been identified as the source of the link between working memory and fluid intelligence (Engel de Abreu, Conway, & Gathercole, 2010). This potentially explains the cross loading of the Arithmetic subtest on FRI(Gf) and WMI(Gsm), and the movement of Arithmetic from the WMI to the FRI in the five-factor solution. The minor loading of Arithmetic on VCI(Gc) in the four-factor model is readily explained by the verbal nature of the Arithmetic word problems.
Two other subtests, Figure Weights and Matrix Reasoning, measured mixed abilities in the five-factor model. Figure Weights loaded on perceptual reasoning in the four-factor model. In the five-factor model, Figure Weights measured mainly quantitative (fluid) reasoning (.53) and some perceptual organization (.29). In this subtest, the examinee is asked to view a scale with missing weights and to select the option that keeps the scale balanced (Wechsler, 2008a). Based on the response processes required by this task, it seems reasonable for this subtest to involve fluid reasoning, working memory, visual processing, and the narrow ability of quantitative reasoning. Matrix Reasoning loaded on the PRI in the four-factor model. In the five-factor model, Matrix Reasoning measured both FRI(Gf) and POI(Gv) (factor loadings were .48 and .27, respectively), a result also observed by Keith et al. (2006).
The finding that some subtests cross load on more than one factor does not necessarily indicate lack of content validity. Cognitive abilities are interrelated in nature. Moreover, data from Figures 1 and 2 show that for every subtest the factor loading for each main ability was clearly higher than its loading for secondary abilities. The point of the discussion regarding cross loadings is to clarify the constructs measured by the subtests relative to the factors so as to inform clinical interpretation of unusual patient profiles. Proposed interpretive hypotheses for each subtest based on its primary and secondary factor loadings are shown in Table 6 above. Given consistent subtest scores within indexes, the column labeled “primary interpretation” is probably the best interpretation of constructs measured by each subtest. When inconsistencies are found, or when examiners wish to test specific hypotheses, the abilities listed in the column “secondary interpretation” may be worth considering.
Overall, our findings have important applied implications for clinical interpretation of WAIS-IV results. Acceptability of the four- and five-factor models in both the normative and clinical samples suggests that these are useful and complementary models for interpreting WAIS-IV findings. For adults with consistent subtest scores within each of the four WAIS-IV composites, the current four WAIS-IV Index scores constitute an appropriate level of interpretation. For adults with discrepant subtest scores within some of the four composites, the five-factor model suggests a likely interpretive reorganization. For those patients who present subtest scatter within the PRI or WMI composite, our findings suggest that a common pattern may be consistencies between POI(Gv) subtests (i.e., Block Design, Visual Puzzles, and Picture Completion), and between FRI(Gf) subtests (i.e., Matrix Reasoning, Arithmetic, and Figure Weights)—but inconsistencies across these two factors. Furthermore, when adults present subtest scatter within the FRI(Gf) factor, our data suggest interpretation of a narrow ability under FRI(Gf) known as Quantitative Reasoning (RQ) (i.e., Arithmetic and Figure Weights). Furthermore, our data suggest that this interpretative approach should be equally applicable for clinical patients and examinees from the general population.
While current results provide solid information for understanding the structure behind the 15 WAIS-IV subtests between normative and clinical adults, it is nonetheless important to note that this study focused primarily on the broad factors and generally did not tap the domains of narrow abilities. Joint CFA with other measures may provide a more complete picture of all abilities measured.
The choice of four- or five-factor models is somewhat controversial in the Wechsler literature. For example, based on the U.S. adult data, Keith reported that “The WAIS-IV four-factor structure fits better than the CHC model when Arithmetic is excluded” (Lichtenberger & Kaufman, 2009, p. 32). Chen et al. (2009) also reported that the superiority of the five-factor model over the four-factor model is somewhat less salient in a population of Asian children: “Taiwanese children revealed more strongly correlated fluid reasoning and visual-spatial processing factors than do American children. The correlation between visual-spatial processing and working memory was also higher (Chen et al., 2009. p. 100).” Therefore, many others variables, such as culture and subtest combination, should be considered when studying factor structures.
In conclusion, our results support the model-data fit for both four- and five-factor WAIS-IV models. Both models explained the data well. Thus both are plausible and have psychometric merit. Results also confirmed findings by Benson et al (2010): that the five-factor model fit better than did the four-factor model for the 15-subtest set, validating the approach for separating FRI(Gf) and POI(Gv) and for loading Arithmetic primarily on fluid reasoning factor. Moreover, factor invariance of the WAIS-IV between normative and clinical samples was also supported. The subtests generally appear to measure the same abilities in both normative and clinical adults, supporting meaningful comparisons of WAIS-IV between these two samples. As Prifitera, Weiss, Saklofske, and Rolfhus (2005) suggested, while factor analysis is a useful tool for informing the best way to interpret relations among subtests, clinical utility should always be considered when selecting factors. Accumulating and balancing validity evidence from psychometric and clinical perspectives should continue.
Footnotes
Declaration of Conflicting Interests
Drs. Weiss, Zhu, and Chen were involved in the research and development of the WAIS-IV and WISC-IV as employees of Pearson, which is the publisher of numerous psychological tests including the Wechsler scales.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
