Abstract
In surveying the literature on assessment of cognitive abilities in adults and children, it is easy to assume that the proliferation of test batteries and terminology reflects a poverty of unifying models. However, the lack of recognition accorded good models of cognitive abilities may reflect inattention to theoretical development and injudicious use of empirical methods to validate models. In contrast, the studies of Weiss and colleagues in this volume reflect an evaluation of a widely cited model of cognitive abilities, the Cattell-Horn-Carroll (CHC) model for understanding the construct validity of the Wechsler Intelligence Scales for adults and children. Using the CHC model as the basis for evaluating a test battery provides an excellent example of statistical analysis as theoretical evaluation, an approach long advocated by methodologists. In this commentary several specific aspects of model evaluation and refinement are also discussed.
Keywords
CFA Is a Powerful Technique to Study Construct Validity
Over the last decade, numerous studies have demonstrated the applicability of a unified model of cognitive abilities. The Cattell-Horn-Carroll model provides an excellent basis to understand the theoretical structure of diverse test batteries used to assess cognitive abilities in adults and children in community samples and numerous special populations (Gladsjo etal., 2004; Jewsbury, Bowden, & Strauss, 2012; McGrew, 2009). The practical value and theoretical significance of a unified model of cognitive abilities should not be underestimated. The field of cognitive ability assessment has moved well beyond the era when models of assessment were criticized as atheoretical and based on arbitrary statistical methods that produced different results depending on the a priori assumptions of researchers (Gould, 1981).
In science, technology and theory often proceed together. Fortunately for the field of psychology in general, and diagnostic assessment in particular, improved methods for testing the theoretical structure underlying a diverse collection of cognitive ability tests has emerged in the form of confirmatory factor analysis (CFA). In contrast to the older methods of exploratory factor analysis (EFA), CFA facilitates explicit hypothesis testing, allowing alternative theoretical models to be evaluated on objective grounds (Brown, 2006; Kline, 2010). Central to the CFA approach is the opportunity to test the hypothesis, using null-hypothesis significance testing, that a specified theoretical or latent structure fits the observed data from a set of test scores. This approach to hypothesis testing is facilitated by the so-called goodness-of-fit criterion by which the fit of a hypothesized model is tested for its ability to provide an accurate representation of observed data. Goodness-of-fit criteria not only permit explicit hypothesis testing but also allow evaluation of the magnitude of the discrepancy between observed data and the hypothesized model. The advantages of CFA are so important that it has been suggested that for any test or test battery, once a test is published or a theoretical structure defined, EFA should be avoided and CFA should be routinely used for theoretical verification and refinement (Floyd & Widamin, 1995; Henson & Roberts, 2006).
Unfortunately, the advantages of CFA are still not well appreciated by many clinical researchers and the incautious, even inappropriate use of EFA sometimes impedes theoretical progress in psychological assessment. A primary problem with the use of older EFA techniques is that solutions derived from these techniques are particularly susceptible to sample-specific solutions that are less likely to be replicated when applied to another sample. The reason for this lack of replicability is that older EFA techniques, including principal components analysis, do not distinguish between variance attributable to the hypothesized model versus variance attributable to measurement error (Floyd & Widamin, 1995; Henson & Roberts, 2006). Because many test scores contain a substantial component of measurement error, the random influences of measurement error are prone to produce different solutions in each study sample. For example, there are numerous, different EFA models reported for the Beck Depression Inventory-II although the original model described for the first edition of the Beck Depression Inventory replicates as well or better than any competitor in the second edition, across diverse samples (Byrne, Stewart, Kennard, & Lee, 2007; Reilly, Bowden, Bardenhagen, & Cook, 2006). Because many EFA solutions are reported for particular tests or inventories, many researchers wrongly think that factor analysis as a class of techniques is unreliable and unrewarding. Instead the primary problem lies in the incautious use of factor analysis. Carefully applied, factor analysis and especially CFA is a mainstay of construct validity evaluation in cognitive ability and personality research (Strauss & Smith, 2009).
Over recent years, developments in EFA methods have improved interpretation, for example, with the provision of goodness-of-fit statistics in EFA output. But the fundamental point remains that incautious application of factor analysis without careful theoretical justification or reliance on cumulative research is likely to lead to unnecessary model proliferation or reduced theoretical clarity. This preamble is intended to underscore the advantages conferred on theoretical understanding afforded by the careful use of CFA, conducted in a robust theoretical framework. In two articles in this issue of JPA, Weiss and colleagues demonstrate the theoretical convergence of the model of cognitive abilities underlying the Wechsler Intelligence Scales for adults and children, converging on the C-H-C model (Weiss, Keith, Zhu, and Chen, 2013a, 2013b). For a student of psychological assessment, the theoretical convergence evident in scientific description of many of the major test batteries provides a conceptual clarity not apparent until relatively recently (McGrew, 2009).
C-H-C Guided CFA of Cognitive Ability Tests
Weiss and colleagues test alternative models of WAIS-IV and WISC-IV scores, contrasting two broadly similar models, the first derived from the Wechsler four-factor model reported from the WAIS-R and subsequently the WAIS-III, WAIS-IV, and WISC-III and WISC-IV (see Bowden, Carstairs, & Shores, 1999; Tulsky & Price, 2003; Wechlser, 2003, 2008). This first model describes four-factors or latent variables underlying WAIS-IV and WISC-IV scores, Verbal Comprehension, Perceptual Reasoning, Working Memory, and Processing Speed. A second model, inspired by C-H-C theory reorganizes subtests under Perceptual Reasoning, splitting Perceptual Reasoning into a Fluid Reasoning factor and a Quantitative Reasoning factor. Weiss and colleagues conclude that the five-factor model fits best in the WAIS-IV scores derived from the adult standardization sample, but the four- and five-factor models of WISC-IV scores fit equally well in the WISC-IV standardization sample (Weiss et al., 2013a, 2013b).
In fact, both the four- and five-factor models tested by Weiss and colleagues can be referenced to variants of the C-H-C model and could be interpreted as plausible alternatives, rather than the five-factor model being more compatible with the C-H-C model (Bowden, Saklofske, & Weiss, 2011a).
Measurement Invariance
Weiss and colleagues (2013a, 2013b) also tested measurement invariance across adult and child clinical samples of the WAIS-IV and WISC-IV scores, respectively, concluding that measurement invariance held across community standardization and clinical samples. Examination of measurement invariance remains a little-known extension of CFA although accessible methods to undertake comprehensive tests of measurement invariance have been available for well over a decade (Horn & McArdle, 1992; Widaman & Reise, 1997). Measurement invariance examines the hypothesis that the same model of test scores generalizes across diverse populations. Examination of measurement invariance provides one of the strongest tests of the generalizability of construct validity across populations (Bowden et al., 2008; Meredith, 1993). Demonstrating measurement invariance of WAIS-IV and WISC-IV scores across the community standardization and heterogeneous clinical samples shows that the scores on the respective tests are attributable to the same latent constructs in the populations examined and the subtest scores behave in the same way across populations. Measurement invariance analysis is relatively technical and involves several steps to extract the information relevant to the strong construct validity inferences. However, the significance of the finding of something close to so-called strict measurement invariance (Brown, 2006; Meredith, 1993) as reported by Weiss et al. should not be underestimated. With the minor caveats to be discussed further below, the finding of strict measurement invariance shows that the model of cognitive abilities, and the item-scores of the subtest indicators of the respective models, is the same in community and the clinical samples. This inference stands in sharp contrast to the historical assumption, underlying the development of many diagnostic batteries for special populations, that implicitly assumes measurement invariance does not hold, namely, that the best tests of cognitive ability in healthy community samples are not the best tests of cognitive ability in clinical populations. Unfortunately, many of the special purpose assessment batteries have been examined with EFA or CFA not guided by theory. When examined in a theoretically guided framework, many of the special purpose batteries used in adult and child neuropsychology, for example, also support the wide generality of what we now call the C-H-C model (for some diverse examples see Gladsjo et al., 2004; Jewsbury et al., 2012; Vernon, 1950).
In the remainder of this commentary, several relatively technical issues highlighted by Weiss et al. (2013a, 2013b) will be discussed. The first concerns baseline model identification across diverse populations. The second issue concerns difficulties in deciding between the best fitting first- or second-order factor models with a finite number of test scores. The third issue concerns the practical impact of the finding of partial measurement invariance and ways to determine whether partial invariance has any practical impact on diagnostic classification.
Baseline Identification of a Factor Model for Invariance Analysis
In their studies of the WAIS-IV and the WISC-IV factor structure, Weiss et al. (2013a, in 2013b) describe a detailed process of comparison of two primary alternatives in the adult and child samples, respectively. As noted above, in both samples, a four-factor model is contrasted with a five-factor model, the primary difference being that in the latter, five-factor model the Wechsler Perceptual Reasoning Index is divided into two factors synonymous with C-H-C Fluid (Gf) and Visual Processing (Gv). There are also some variations in the arrangement of subtests under these Gf and Gv factors (see the respective Figures 2 in Weiss et al., 2013a, in 2013b). Both the four- and five-factor models are presented as second-order models, that is, having a superordinate general ability factor, labeled as FSIQ.
For evaluation of measurement invariance it is generally recommended that a baseline model is established as best fitting in each sample to be included in the invariance analysis (Brown, 2006; Byrne, 1998; Widaman & Reise, 1997). This separate sample analysis is conducted prior to invariance analysis. The purpose of the separate-sample analysis is to avoid the potentially unwarranted assumption that the best fitting factor model described in one sample applies to another sample or population without thorough CFA verification that the model is the best fitting in every sample. Conceivably, a four-factor model fits best in a community sample but a five-factor model fits best in a clinical sample. Under such circumstances, measurement invariance is qualified and may only be described for a subset of the factors, if synonymous factors can be compared across models with different numbers of factors. In fact, baseline differences in the best fitting factor structure across diverse populations may be uncommon for cognitive ability batteries, further evidence of the robust generalizability of a C-H-C model of cognition (Bowden et al., 2011a; Gladsjo et al., 2004; Goldstein & Saklofske, 2010; Salthouse & Saklofske, 2010).
But there may be another good reason for an exhaustive search for the best fitting baseline model across separate samples. The reason lies in the observation that the best fitting factor-structure is sometimes more obvious in heterogeneous clinical samples because of greater variance in the observed scores and hence greater variation in the underlying factors. Population differences in the observed variance-covariance structure between observed test scores can assist in clarification of the best fitting factor structure, and it is possible, for example, that alternative four- and five-factor models do not differ in terms of goodness of fit in a community sample, but may show more obvious differences in goodness of fit in clinical or developmental samples (Brown, 2006; Kline, 2010).
A final point on baseline model estimation relates to the data used for analysis. Weiss and colleagues apparently used subtest scaled-scores for their analyses judging by the reported subtest means. There are some advantages to using raw scores for invariance analysis. First, the raw score covariance matrix differs from the scaled-score covariance matrix, the latter assuming that all subtests have equal variance. Therefore, there may be some loss of information in use of the scaled-scores, as opposed to raw scores. Information contained in the raw scores may have aided differentiation of alternative models during baseline testing and increased the sensitivity of the invariance analysis. For related reasons, it is preferable to undertake invariance analysis on raw scores because the finding of invariance will then carry through to transformed scores, for example, scaled-scores, in the populations for which invariance was demonstrated. Using transformed scores as the data for analysis makes these inferences less clear (Widaman & Reise, 1997). Although use of raw scores and examination of the best fitting models in the clinical samples may not have changed the results, these considerations deserve careful attention in future C-H-C and Wechsler Scale model refinement studies.
First- or Second-Order Models?
Another question with a bearing on the practical impact of the results reported by Weiss et al. relates to the analysis of second-order factor models (see their respective Figures 1 and 2; in 2013a, in 2013b). As noted, for all their models Weiss and colleagues assumed that a second-order factor, general cognitive ability (labeled FSIQ) was superordinate to the first-order factors in the alternative four- or five-factor models. In addition, in both of the five-factor models tested in WAIS-IV and WISC-IV, respectively, Weiss and colleagues interposed another level of factor under the Gf or Gv factor.
Although it is common to encounter descriptions of “higher order” cognitive abilities in diverse assessment settings, together with informal attempts to distinguish higher order behavior from lower order behavior in assessment approaches (e.g., Lezak, Howieson, & Loring, 2004) the statistical requirements to unambiguously demonstrate the presence and advantages of higher order latent-factors are awkward and preclude easy analysis in many applied contexts. The reason for this difficultly lies in the requirements for “statistical identification” of factors. In factor analysis, any factor that is measured by only two unique subtest scores is known as “just identified.” Here, unique implies that the subtest scores do not share cross loadings with other factors. If any factor is measured by less than two unique subtests, the factor is termed “underidentified” and commonly model estimation is difficult or unsuccessful (Brown, 2006; Rindskopf & Rose, 1988). Most contemporary CFA software would flag model estimation problems, or the results obtained would indicate model misspecification, requiring respecification of the model to be tested. If a first-order model with at least one just-identified factor is compared to a second-order model, which differs only in terms of the addition one or more superordinate factors, then the alternative first- and second-order models will not be statically distinguishable. If the model is underidentified, factor loadings or other model parameters may show clear misspecification or other estimation problems.
Inspection of the alternative four- and five-factor models reported for the WAIS-IV and WISC-IV by Weiss and colleagues (see the Figures 1 and 2 in the respective studies) show that statistical identification or specification problems may have been encountered. For the four-factor model reported for the WAIS-IV, the Working Memory factor is just identified. For the WISC-IV, the Working Memory and Processing Speed factors are just-identified. As a consequence, there is no particular advantage in fitting a second-order model because the results do not provide useful evaluation of the second- as opposed to first-order model. This is not the same as saying that the second-order model is wrong. Rather, the second-order model is a theoretical predication based on the hierarchical model of cognition described in line with C-H-C theory (McGrew, 2009). But the hypothesized factor structure precludes a useful test of the relative goodness of fit of the second-order model compared to the first-order model with the finite set of subtest scores available for analysis. The simplest solution under these circumstances is to set aside the second-order model, noting that the data do not permit a useful test, and focus on detailed evaluation of the first-order model.
More obvious estimation problems were evident with the second-order, five-factor models (the respective Figures 2 in Weis et al., 2013a, in 2013b) and suggest that underidentification or model misspecification led to unrealistic parameter estimates that should have led to model respecification. In the case of the five-factor model for the WAIS-IV, the second-order factor loading from FSIQ to Fluid Reasoning (FRI) is shown as .99. Although standard errors are not reported, this factor loading is not likely to differ significantly from one. For the corresponding model of the WISC-IV (Figure 2), the factor loading from FSIQ to FRI was initially reported as greater than one, and was then fixed to one for subsequent analysis. For both of five-factor models, the second-order factor loadings indicate, in effect, that FSIQ and FRI are the same latent-constructs. Under these circumstances, one strategy would be to combine these constructs but the authors are not likely to have wanted to convey the impression that FRI is the same as FSIQ or general intelligence. Instead, a better strategy would have been to eliminate the second-order factor (FSIQ) for the same reasons as suggested for the second-order, four-factor model above, and carefully evaluate the first-order five-factor model without the unnecessary estimation problems.
The argument about admissible models and realistic parameter estimates is more than esoteric factor-analytic detail because the practical implications of interpreting alternative four-and five-factor solutions are complex, involving reformulation of an overlapping set of subtest scores corresponding to Perceptual Reasoning, if one accepts the four-factor solution. Instead, if one prefers the five-factor solution, then interpretation of the same subtests involve Fluid Reasoning, Perceptual Organization, and Quantitative Reasoning (for the WAIS-IV) or Inductive Reasoning (for the WISC-IV). Previous analyses of the WAIS-IV data have suggested that a five-factor solution distinguishing a Visual Processing factor from Fluid Reasoning (Gf) is not admissible and should not be considered a viable alternative (Bowden et al., 2011a; Bowden, Saklofske, & Weiss, 2011b). The utility and viability of any five-factor solution for the WAIS-IV and WISC-IV deserves additional study with a careful delineation of theoretical predictions versus the inevitable practical limitations of modeling a finite set of subtest scores.
Practical Impact of Alternative Factor Solutions and Partial Measurement Invariance
Weiss and colleagues (in 2013a, in 2013b) report partial measurement invariance from their study of the WAIS-IV across the standardization and clinical samples. Specifically, they reported a failure of invariance for WAIS-IV subtest intercepts of Symbol Search and Block Design. For WISC-IV Word Reasoning, failure of residual invariance was observed but this finding has no practical implications for assessment. Weiss and colleagues (in 2013a, in 2013b) are correct to note that there is some uncertainty regarding the best criteria by which to judge failure of measurement invariance although the criteria are undergoing regular revision (Meade, Johnson, & Braddy, 2008). The most important aspect of the finding of partial failure of measurement invariance arising from intercept differences is to determine the extent to which the failure has an impact on classification accuracy. It is possible that the partial invariance is of little or no practical impact but only direct evaluation of the question provides clarification of this problem.
Millsap and Kwok (2004) have described a method for evaluating the practical impact of partial measurement invariance in terms of diagnostic classification rates obtained under conditions of full versus partial measurement invariance. This method may be of value in many applied settings because some statistically significant failure of invariance might be expected when a small-sample statistic such as χ2 is applied to large sample studies (Meade et al., 2008). To date Millsap and Kwok’s method has been little used but provides an important avenue by which to inform clinicians regarding the classification impact of partial measurement invariance. In a recent application, Alkemede, Bowden, and Salzman (2012) found partial measurement invariance for a four-factor model of the MMPI-2 Hs scale across the MMPI-2 standardization versus traumatic brain injury samples. However, analysis of classification accuracy under conditions of full versus partial invariance using Millsap and Kwok’s (2004) method demonstrated no significant change in classification accuracy (sensitivity and specificity) between invariance conditions. This example shows that some degree of partial measurement invariance may have no impact on the classification accuracy of a scale, but this question needs to be evaluated directly for each scale if partial invariance is observed.
On a different note, the ambiguous conclusions of Weiss and colleagues regarding the relative merits of the four- and five-factor solutions for the WAIS-IV and WISC-IV need to be interpreted with respect to contemporary approaches to elucidation of clinically significant differences between Index scores. Contemporary approaches facilitate individual interpretations anchored in a theoretical framework (Gottfredson & Saklofske, 2009). For example, interpretation of WAIS-IV subtest scatter now reflects a more cautious, base-rate oriented method of interpretation, seeking to minimize the false-positive rate of “abnormality.” Lichtenberger and Kaufman (2009) provide an excellent example of a cautious, empirically oriented approach to detection of clinically meaningful differences between WAIS-IV Index scores and within-Index subtest scatter. If used in conjunction with the scoring software provided by the test publishers (Wechsler, 2003, 2008), then the best standardization sample data can be used to guide alternative reformulations of WAIS-IV scores on a case-by-case basis.
Conclusions
The studies of Weiss and colleagues should be welcomed by psychologists concerned to refine cognitive ability assessments, and their work provides a good model for improved theoretical understanding of some of the most widely used tests of cognitive ability in adults and children. As Weiss et al. show, we may have moved beyond the era when discussion of the theoretical model underlying alternative cognitive ability batteries required the adoption of alternative, test-specific, latent-construct terminology. Instead, we can adopt shared terminology, centered on the C-H-C model, and focus on refinements including divergence and overlap of alternative test battery approaches with respect to a unified model of cognitive abilities.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
