Abstract
This discussion article addresses issues related to expansion of the Wechsler model from four to five factors; multiple broad CHC abilities measured by the Arithmetic subtest; advantages and disadvantages of including complex tasks requiring integration of multiple broad abilities when measuring intelligence; limitations of factor analysis, which constrain test developers to creating specific broad and narrow abilities as opposed to integrative tasks; implications from brain imaging research showing the critical role of neurological pathways that integrate brain regions; close relationship of the fluid reasoning factor to g, and the inadequacies of factor analytically driven statistical definitions of g in the development of improved models of intelligence. In this rejoinder to the commentaries in this special issue on structural models of the WAIS-IV and WISC-IV, the advantages and disadvantages of Schmid–Leiman’s transformation, which removes the effects of g on the broad abilities, and the use of nested or bifactor models in evaluating models of intelligence are also discussed.
This special issue is important because it contributes substantially to an ever-increasing body of knowledge that will allow us to build better models of intelligence in future editions of our intellectual assessment tools. We greatly appreciate the time and thought that each commentator has put into their reviews, and we are humbled by the largely positive attention that our work has received. We are especially grateful to Alan Kaufman for his insightful and balanced integration of the eight commentary papers (Kaufman, 2013). We divide our rejoinder into two broad sections on conceptual and practical issues, and technical and methodological issues.
Conceptual and Practical Issues
The purpose of our two target articles in this special issue (Weiss, Keith, Zhu, & Chen, in 2013a; Weiss, Keith, Zhu, & Chen, in 2013b) was to understand the constructs measured by the WISC-IV and WAIS-IV, and to determine whether the same constructs are measured in mixed clinical groups. Many of the criticisms of our articles relate to issues that we readily acknowledge are also important but go beyond the purpose of the studies. From the WAIS-IV article: “The purposes of this research were to determine the constructs measured by the test and the consistency of measurement across large normative and clinical samples.”
Competent assessment includes more than test scores, no matter how psychometrically sound they may be. As Schwartz (2013) points out, clinicians must also consider the client’s environment, developmental trends, effort, individual differences in approaching the same task, and various conative factors such as drive, desire, and volition. We agree! Still, it is too much to ask, as Schwartz does, that all these issues be addressed in an article about factor structure. Weiss and colleagues cover these important applied topics in their five textbooks on clinical use and interpretation of WISC-IV and WAIS-IV, and interested readers are referred to those volumes (Holdnack, Drozdick, Weiss, & Iverson, 2013; Prifitera, Saklofske, & Weiss, 2005, 2008; Weiss, Saklofske, Coalson, & Raiford, 2010; Weiss, Saklofske, Prifitera, & Holdnack, 2006); see also Kaufman and colleagues’ excellent texts (Flanagan & Kaufman, 2009; Lichtenberger & Kaufman, 2012). We assume that the information contained in our target articles will be viewed in the broader context of competent clinical assessment rooted in the study of individual differences.
Schwartz pointedly questions why the clinical sample sizes we used are smaller than those reported in the technical manuals. This is easy to answer. We included only participants who took all subtests—because our analyses, especially for the five-factor model, benefit from including all core and supplemental subtests. He further argues that our clinical samples are unrepresentative of the clinical disorders from which they were drawn, making them useless for understanding patterns that may be displayed by any particular clinical group. We agree, in part. And this is one reason we did not report CFA models separately for each clinical disorder. We sought only to evaluate if the model held in a mixed clinical sample as may be referred for assessment in most schools, clinics, and hospitals. This is an important step in a research program concerning the clinical utility of the five-factor model, and research should continue into the performance of well-defined clinical groups on the fifth factor.
We use both core and supplemental subtests to better analyze the structure of the tests by including as many marker variables for each factor as possible. While this is best practice for research, we recognize that it is unlikely that practitioners will administer all of the supplemental Wechsler subtests on a routine basis due to time and reimbursement constraints as well as fatigue effects with real patients as pointed out by Schwartz. Subsequent to our target articles, Niileksela, Reynolds, and Kaufman (2012) published a useful study showing that the five-factor solution for WAIS-IV can be created from only the 10 core subtests if Digit Span Forward, Backward, and Sequencing are each loaded separately on Gsm. In this model Gf is formed from Matrix Reasoning and Arithmetic, and Gv is formed from Block Design and Visual Puzzles. The model was found to fit the data well in older adults and also across the life span of 16 to 90 years of age. This study is important because it is the first to extend the WAIS-IV five-factor model to older adults. It is also important because it allows practitioners to apply a five-factor model to adults of any age without having to administer more subtests than are necessary to obtain the FSIQ. These authors point to norms that have been created for these factors, thus providing a practical clinical application of the CFA results.
Several commentators in this issue, especially Flanagan, Alfonso, and Reynolds (2013), have suggested that the Wechsler tests should be expanded to include as many CHC broad and narrow abilities as possible in future editions. However, Goldstein recommends caution in that regard as the Wechsler model has demonstrated proven clinical validity for multiple decades. In reviewing the history of models of intelligence, Grégoire (2013) helps us remember that these models are constantly evolving. As popular and useful as CHC theory is today, the pace of science continues unabated and it is unlikely that future generations of intelligence researchers will look back across the decades and view CHC as the final word in intelligence theory. Although CHC-derived construct expansion of the Wechsler tests has merit and should be considered, it should simultaneously be kept in mind that CHC is an empirical model derived from factor analyses—which has limitations. We agree with Flanagan et al. that the inclusion of more variables in factor analyses will help us more fully understand the tests’ structure. Such cross-battery analyses will be especially useful in understanding complex tests such as Arithmetic, and narrow abilities, such as our hypothesized Induction factor. The reference variable CB-CFA approach illustrated by Reynolds and Keith (2013) should be especially useful in this regard.
Nevertheless, factor analysis (much like coefficient Alpha) is only a measure of the internal coherence of a test’s structure and has little to say about external validation or clinical usefulness. As several commentators on our articles imply, coherent factors are a necessary but insufficient condition for external validation. More importantly, factor analytic techniques place high value on unidimensionality and sometimes on orthogonality as well. In other words, tasks must measure only one thing and be unrelated to other things (except via g). As such, a blind adherence to factor analytic methods in test development promotes the proliferation of an ever-increasing number of narrow band tasks that are relatively independent of each other and lead to longer test batteries. Moreover, factor analytic methods punish researchers (with cross loadings and lower model-data fit statistics) who develop complex, multidimensional tasks that require the simultaneous processing and integration of more than one cognitive ability.
These tried and true statistical methods stand in stark contrast to an exploding new area of brain imaging research into the strength of white matter transmission between regions of the brain and its causal relationship to individual differences in intelligence. Evidence is emerging that one of the largest networks in the brain—the fronto-parietal control network—is central to cognitive control and fluid reasoning (Cole & Schneider, 2007; Jung & Haier, 2007). P-FIT, as this theory is known, suggests that fluid intelligence is related to well-functioning white matter connections, which allow fast and orchestrated information transfer between brain regions (Haier, 2012). Given that neuropsychological studies also point to the importance of central executive control guiding integration of cognitive resources, such integrative tasks should not be ignored—especially as they seem to have higher g loadings.
This is the so-called “problem” of the Arithmetic subtest as a measure of Gf, Gsm, and possibly Gc—which was referred to by several commentators. From our perspective, the “messy” multidimensionality of Arithmetic is not a problem at all. In fact, the very high loading of Arithmetic on g is an opportunity. Kaufman (2013) makes an impassioned case for retaining Arithmetic in the Wechsler tests, reminding us that Binet believed multidimensional tasks were the key to intelligence and that Wechsler knew full well that Arithmetic tapped more than one ability. We agree completely, and recommend the routine administration of the Arithmetic test. Intriguingly, Schneider (2013) suggests that Arithmetic should contribute to FSIQ directly rather than through any one of the first order factors, and this is worth further consideration. Rather than continue to argue about what factor Arithmetic belongs to, we should seek to understand why the integrative demands of complex multifactorial tasks like Arithmetic are so highly g saturated and which brain pathways are activated when performing such tasks. For those who do want to understand what factor Arithmetic belongs to, the cross-battery approach advocated by Flanagan et al. provides the best bet in gaining that understanding.
In our target articles, Arithmetic loads predominantly on a factor we call fluid reasoning (Gf), which our data show is virtually redundant with psychometric g. Far from “troubling,” this finding of a near-perfect relation between g and Gf is fairly common (e.g., Gustafsson & Balke, 1993; Keith & Reynolds, 2012; Niileksela et al., 2012). As pointed out by Bowden (2013), one interpretation of this finding is that a higher order g factor is redundant and unneeded. Another alternative, apparently the preference of Canivez and Kush (2013), is that Gf is redundant. We tend toward the explanation that g likely represents historical, accumulated Gf (for more information, see Reynolds & Keith., 2013). That is, investment of g resources in particular directions may, over time, result in greater development of particular abilities in individuals.
Psychometric g is a statistical definition of general intelligence that is extracted from whatever subtests are included in the battery. Perhaps a more conceptually appealing definition of general intelligence, or g, involves fluid reasoning as the ability to integrate multiple cognitive abilities from different brain regions in the service of solving novel problems. As Grégoire reminds us, Gf should not be considered just one among a set of five or seven equally important broad abilities, but Gf (and Gc) have a special status in the original Cattel-Horn model. And, as neuropsychologists have been implying for decades, simply summing scores from a multitude of narrow band abilities is certainly not the same thing as performance on a task that requires real time integration of those abilities. Perhaps Gf, when conceptualized as an integrative ability, is the ecological g that has eluded researchers for more than a century.
Our point is simply this: Factor analysis should be considered as one important tool when developing tests but not the ultimate criterion for test development in all situations. To be clear, however, we need both narrow band and integrative task types to assess patients for different purposes: diagnostic evaluations of specific impairments and prediction of real-world cognitive functioning, respectively. Both approaches are useful, but require different validation strategies.
Schwartz states provocatively, “There is no empirical support for a unitary construct or ability that contributes to successful performance of a given subtest on WAIS-IV or WISC-IV.” We think Canivez and Kush would strongly disagree. Their article in this issue clearly shows that g is the unitary construct, which strongly predicts performance on all factors and subtests. Canivez and Watkins (2010a, 2010b), which we overlooked in our original literature review, found that all WISC-IV and WAIS-IV subtests are “… properly associated with their four theoretically proposed first order factors.” However, they also stated that exploratory factor extraction criteria support only one or two factors and that the second-order g factor accounted for the greatest portions of the total and common variance. We would question these authors’ decision to use exploratory rather than confirmatory factor analyses because the conceptual model was clear. They concluded, “The modest portions of variance attributed to the first-order factors may be too small to be of clinical importance despite their CFA support.” We point this out because it relates to a dichotomy in the field in terms of the perceived value of g relative to the broad factors. While one camp of psychometric researchers aligned with Canivez and Watkins argue that g is the only score worth interpreting, other imminent researchers have argued that structural and developmental evidence do not support g (Horn & Blankson, 2005). Many clinical neuropsychologists such as Schwartz have long argued that g is a meaningless composite of various disparate abilities and should not be calculated, much less interpreted (Hale, Fiorello, Kavanagh, Holdnack, & Aloe, 2007; Kaplan, 1988; Lezak, 1988; Luria, 1979). We do not completely agree with either camp. Both views have merit, but both views are too one-sided. Both g and the broad abilities are important (Keith, 1994).
What is a practitioner to do? We humbly suggest that it depends on the referral question. When the purpose of the evaluation is to efficiently predict a broad range of cognitively driven behaviors then g—as currently defined—is always the best bet. Furthermore, we think that heterogeneous tasks that require integration of multiple abilities for successful performance will enhance the ecological validity of predicting a broader range of cognitively driven, real-world behaviors. On the other hand, examining particular broad and narrow abilities is necessary when evaluating clients for specific cognitive impairments, neurological insults, or injuries. Thus it is not one approach or the other—especially because strength and weakness interpretations vary in the context of the patients’ overall level of g as evidenced by differing frequencies of discrepant indexes by ability level (Wechsler, 2003, 2008). There also is evidence that one or more low scores in a profile is commonly observed among healthy individuals and therefore practitioners should be cautious when interpreting low scores as conclusive evidence of brain injury or disease in forensic evaluations (Brooks, Holdnack, & Iverson, 2011). The common finding of one or more low index scores in normal participants, however, suggests that at the level of the individual, g does not manifest itself equally across the broad cognitive abilities. Whether for reasons of environmental opportunity or personal and vocational interest individuals appear to invest g resources selectively, thereby developing some broad abilities at the expense of others over time (cf. Cattell, 1987; Kvist & Gustafsson, 2008; Reynolds & Keith, 2013). This is one reason that clinicians find it difficult to conceptualize the broad abilities independent of g—although it is possible to accomplish statistically as discussed below. As Claeys (2013) observes, “No one is more aware that test factors don’t always ‘hang together’ than those assessing children and adults on a daily basis.” How true!
Technical and Methodological Issues
As Kaufman (2013) observed, “The authors of the eight response articles, almost universally, praised Weiss and colleagues for the high quality of their analyses and the thoughtfulness of their interpretation.” The Canivez and Kush (2013) article stands as an outlier among this set of largely positive commentary articles for the harshness of its criticisms. However, many of those criticisms relate to our purported “failure” to employ methods for evaluating the effects of the first-order factors with the influence of g removed (i.e., Schmid and Leiman’s transformation, and bifactor or nested models). We did not employ these methods because the aim of our research study was not to evaluate the predominance of g over the broad abilities. As previously stated, our purpose was to determine the constructs measured by these two instruments, and if those constructs generalize to mixed clinical samples. In many ways, the Canivez and Kush article is an attempt to shift the discussion from a debate about four versus five factors to a debate about whether any factor is interpretable given the overriding predominance of g. Since that was not the purpose of our research, we will not engage extensively in that debate. We will simply note that we did not downplay the importance of general intelligence relative to the broad abilities as they suggest; after all, g sits at the apex of every model we tested! We strongly believe that psychologists should interpret the broad abilities within the context of the individual’s general level of intelligence, consistent with a broader view of competent clinical assessment as discussed in the textbooks cited above.
As shown, among other places, in another analysis of WAIS-IV data (Niileksela et al., 2012), the S-L transformation of a higher order model and the results of a bifactor (also referred to as a nested-factor or a direct hierarchical) model are likely to produce similar results. We believe that the higher order S-L approach has advantages over the bifactor- S-L approach. Most importantly, the use of the S-L transformation with a higher order model makes it clear that the transformed loadings show the effects of the broad abilities with the effects of g removed. There is nothing wrong with assuming, as this approach does, the interpretive predominance of g, but it should be clear that such an assumption is being made. Furthermore, those who build this assumption into the analysis and who then claim the results support the interpretive predominance of g are committing the logical fallacy of begging the question. The bifactor approach only allows the second of these (see Keith, Low, Reynolds, Patel, & Ridley, 2010, and Reynolds and Keith, 2013, for more detail). The higher order S-L approach allows one to examine both the effects of the broad abilities as well as the effects of the broad abilities with g removed. Both views of the data are appropriate for different purposes.
Using a S-L transformation as recommended by Canivez and Kush, the Niileksela et al. team concluded the opposite from Canivez and Kush: “Both g and first order factors had important effects on test scores” (p. 6). Furthermore, the Flanagan et al., Schneider, and Niileksela et al., articles stand in stark contrast to the Canivez and Kush article by strongly encouraging clinical interpretation of the five first-order factors. As Schneider noted with respect to removing g influences on the broad abilities when evaluating those abilities, “… the independent portion is not the ‘real Gc.’ We care about a sprinter’s ability to run quickly, not residual sprinting speed after accounting for general athleticism. So it is with Gc: g is a part of the mix” (p. 188).
Bowden, noting only “minor caveats” with our methodology, provides much good advice on CFA and how to handle some very technical issues, and there is much to agree with. We only disagree on two small points. First, we don’t think “under-identification or model misspecification led to unrealistic parameter estimates that should have led to model re-specification” (p. 153, referring to high loadings of Gf on g). This is a common finding and one that appears even when there are multiple measures of Gf (Reynolds & Keith, 2013). Second, we disagree that
If a first-order model with at least one just-identified factor is compared to a second-order model which differs only in terms of the addition one or more super-ordinate factors, then the alternative first- and second-order models will not be statistically distinguishable. (p. 152)
We assume that by statistically distinguishable Bowden is referring to being able to differentiate the higher order versus the first-order model based on fit, what are commonly called nonequivalent models (Brown, 2006; Keith, 2006, chap. 12). Bowden’s discussion of identification is useful, but on this point we disagree. To demonstrate, we simulated a higher order model of 16 tests measuring 8 first-order factors, which were, in turn, reflections of a single second-order factor. Data generated from this model were subjected to CFA of a (correct) higher order model and of a first-order model with correlations among factors. The fit statistics for the two models indeed differed, demonstrating that they are statistically distinguishable and nonequivalent (in our experience, first-order models almost always fit better than higher order models). The reason that the second-order portion of the model is capable of unique estimation is that the second-order structure is essentially estimated via the covariances of the first-order factors. If there are three first-order factors, then the second-order structure is just-identified (assuming that the factors are correlated), but if there are four or more factors, the second-order portion of the model is overidentified and contributes to the χ2 and other fit indexes. Although it is always desirable to have more than two indicators per first-order factor, it is generally not required given multiple correlated factors. Results of our simulation are available from the second author on request.
Grégoire expresses concern that the five-factor models do not clarify interpretation because of cross-loaded subtests on the Gf factor. Flanagan et al. also express concerns about the heterogeneity of the construct measured by the Arithmetic subtest, yet seem largely satisfied that the Wechsler’s sufficiently measure the five main domains of CHC. Flanagan et al. and Grégoire each note that full construct coverage of CHC is not achieved by Wechsler subtests alone and call for the creation of new Wechsler subtests to improve coverage of all CHC abilities. Yet Grégoire acknowledges the impracticality of administering a larger battery in practice.
Schneider (2013) suggests some excellent ideas to help practitioners estimate and evaluate latent factor scores. His method assumes that one should look at the latent factor scores with the effect of g removed (as in an S-L approach), construct validity-based confidence intervals (CI’s) and compare them to the CI around the estimated latent g score. This approach results in very wide CIs that make it difficult—as Schneider points out—to state with much certainty that even very large differences (e.g., 1½ standard deviations) are meaningful between a latent g score and a latent Gs score with g removed. If that is the case, then why bother to give a test at all? Certainly FSIQ could not fairly summarize the general intelligence of a child with such diverse set of cognitive abilities. As described above, however, we believe it is equally valid to examine the latent broad abilities without g removed. This approach would result in considerably smaller standard errors around those scores. A further refinement to the Schneider proposal might involve using maximum likelihood methods to impute the latent Gf, Gc, and so forth scores in the standardization data. The CIs could then be calculated from the distribution of latent scores for a given composite score.
Schneider also provides a good discussion of confidence intervals and the likelihood that a score is above or below a particular level of interest. This is useful because our experience has shown that some practitioners have an overly rigid view of confidence intervals (CI). They may think that a 95% CI which ranges from, say, 68 to 78, means that there is a 95% chance that the score is 68 and also a 95% chance that it is 78 as well as anything in between. This is absolutely not true. What is true is that there is a 95% chance that this range contains the true score, but the most likely true scores are closest to the middle of the CI, and importantly, there is far less of chance that the true score is 68 or 78 and much higher chance that it is very close to 73. Practitioners should keep this in mind when evaluating profiles with overlapping confidence intervals and consider what level of confidence is necessary given the clinical question being asked (e.g., litigation in forensic cases vs. hypothesis generation for intervention planning)
Final Thoughts
Lost in these technical discussions of the number of factors, the predominance of g, and the calculation of confidence intervals are our findings that both the four- and five-factor models were supported in samples of mixed clinical groups. As Bowden points out, “… the significance of the finding of something close to so-called measurement invariance (between clinical and nonclinical groups) … should not be underestimated.”
Applying the spectacles of history, we note that several independent research teams following different lines of inquiry are converging on a model of intelligence that includes at least five of the same main domains. We believe this convergence is ultimately reassuring for the progress our field is making as science.
Footnotes
Declaration of Conflicting Interests
Drs. Weiss, Zhu, and Chen are employed at Pearson which is the publisher of the WISC-IV and WAIS-IV.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
