Abstract
The practice of individual assessment has been moving toward the empirically derived Cattell–Horn–Carroll (CHC) theory of intellectual ability, which offers a hierarchical taxonomy of cognitive abilities. Current assessment tools provide varying adherence to operationalizing CHC theory, making clinical inference difficult. Expert consensus explicating the factors of contemporary tests has proven to be valuable by showing the promise of using a multifactorial assessment of intellectual abilities. The Weiss et al. (2013) articles in this special issue represent the next evolution of this approach by providing empirical research detailing the test factors of the popular Wechsler tests, so that practitioners using those tests can base their clinical inferences on an evidence-based understanding of their measurement tools.
Background
In the past decade, we have witnessed an increasing adoption of the psychometrically derived Cattell–Horn–Carroll (CHC) theory of cognitive abilities in the field of school psychology (McGrew, 1997, 2005, 2009; Schneider & McGrew, 2012). It is tempting to view this as a paradigm shift, but that is a rather generous interpretation since it presumes that the field (and least in practice) had, even informally, adopted a scientific theory to explain the myriad of factors produced by test publishers. At best, our field was mired in what Kuhn (1962) described as the preparadigm phase. In short, our conceptual understanding of intelligence was moored to the tests themselves not to an evidence-based theory. That is not to say that models detailing the structure of intelligence weren’t proposed, researched, and debated. From its very inception intelligence, as we conceive of it today, was vested with a multifactorial hierarchical model that Alfred Binet called “scheme of thought” (Binet & Simon, 1916). Binet’s proposed edifice included a general factor, four “broad” factors, and as many as 10 “narrow” factors. Unfortunately, Binet lacked the tools to unravel specific factors psychometrically; eventually resigning himself to measuring what he referred to as, the sum total of higher mental processes. Other attempts to model intelligence followed under such names as Burt, Cattell, Guilford, Horn, and Vernon; some finding evidence of a general factor, and others, not. Those models never seemed to gain traction with practicing psychologists, not because of the scholarly nature of such models, but because the tools didn’t exist that operationalized the theories. In other words, we had become a profession reliant more on tools than on ideas.
Carroll’s (1993) ambitious attempt to herd the unruly factors into a meaningful taxonomy avoided the fate of being a mere academic theory locked away in books and journals; only spoken of by scholars and idle dilettantes in coffee shops. This happened for a number of reasons. For one, his model appeared to replicate the considerable works of Cattell and Horn. Second, his model, while operationalized by at least one major battery of intellectual ability (i.e., Woodcock–Johnson III), could be applied using popular tests that lacked any theoretical orientation at all (McGrew, 1997). Most importantly it seems that the timing of his work coincided with a major shift in how the field began to view learning disabilities. While the definition of a learning disability remained unchanged, the practice of operationalizing that definition had begun to recognize the polythetic nature of specific learning disabilities; in that, there is no single universal criterion that can encapsulate identification of the disorder. Instead, there are multiple factors that may occur in varying degrees, or not at all. The outmoded use of simple mathematical discrepancies between a single generalized ability score and specific academic skills failed to measure the multifactorial dependencies that support the acquisition and application of specific academic skills such as reading, math, or written language (see Prifitera, Saklofske, & Weiss, 2008). Instead practitioners are offered a number of reasonable methods for determining the presence of a learning disability such as the Discrepancy/Consistency Model proposed by Flanagan, Ortiz, & Alfonso (2007), or the Concordance–Discordance model proposed by Hale and Fiorello (2004, p. 180; see also Berninger, O’Donnell, & Holdnack, 2008). Each of these models, which together, might be referred to as Patterns of Strengths and Weaknesses frameworks, require tools that are capable of capturing the multifactorial nature of learning. The empirically derived CHC theory (a synthesis of the Carroll three-stratum theory and the Cattell–Horn Gf-Gc theory) appears to meet that need because it was based on an immense dataset, and instead of relying on a particular test to operationalize the theory it was test agnostic, and in fact, appeared to describe nearly all tests regardless of their original intent (Keith, Kranzler, & Flanagan, 2001). The benefit of this approach is that it allows practitioners to continue using tests they are familiar with while slowly integrating the elements of CHC theory using expert consensus detailing the narrow and broad factors measured by contemporary tests. And furthermore having a taxonomy such as CHC theory provides guidance for engaging in the fairly universal practice of interpreting scores across more than one test battery (Flanagan et al., 2007).
Wechsler Factor Structure
So it is within this context of a theory and practice that the articles by Weiss, Keith, Zhu, and Chen (2013a) and Weiss, Keith, Zhu, and Chen (2013b) take particular importance by providing users of legacy tests with empirical evidence of the factor structure vis-à-vis CHC theory. As noted in Weiss et al. (2013a) the Wechsler tests are by far the most popular tests of intellectual ability. The most recent version has some factor names that appear to line up directly with CHC factors, in particular the verbal comprehension, processing speed, and working memory factors outwardly appear to be measures of Gc, Gs, and Gsm respectively. Indeed expert consensus has arrived at that very conclusion (Flanagan et al., 2007). However, the Perceptual Reasoning factor might leave some practitioners puzzled as to whether to interpret it as Gf or Gv. The two articles in this series by Weiss et al. (2013a, 2013b) conclude that both the WAIS-IV and the WISC-IV four-factor structure described in the technical manual (Wechsler, 2003) is sustainable. But both papers by Weiss et al. were able to extract two distinct factors from within the Perceptual Organizational factor that appear to be measures of Gf and Gv on both Wechsler tests and were thus labeled POI(Gv) and FRI(Gf). This was achieved on the WISC-IV by arranging the block design and picture completion subtests under a POI(Gv) factor, while the matrix reasoning, picture concepts, and arithmetic combined to create a FRI(Gf) factor. On the WAIS-IV the block design, visual puzzles, and picture completion subtests make up the POI(Gv) factor and the matrix reasoning, arithmetic, and figure weights subtests make up the FRI(Gf) factor.
This new configuration of five CHC factors offers an enticing interpretation option. But the Weiss et al. articles also note a number of other important findings regarding the anatomy of the subtests under each factor composition. For example, the arithmetic subtest appears to have an uncomfortable fit within any single factor using the standard four-factor mode of the Wechsler tests; loading on both the POI and WMI factors. However, the behavior of arithmetic under the five-factor model is more tame, largely being a measure of Gf on the WISC-IV and demonstrating an affinity for being a narrow quantitative reasoning (RQ) factor (a narrow factor of Gf) on the WAIS-IV due to the addition of the figural weights subtest. Taken together it appears that interpreting arithmetic as RQ is tenable on both the WAIS-IV and WISC-IV. The former due to explicit factor evidence described in Weiss et al. (2013a) and the latter by simply extrapolating the results of the WAIS-IV factor structure, which has better representation of this prospective narrow factor. Also, it’s hard to ignore that a cursory examination of the arithmetic items is suggestive of content that would measure the narrow factor RQ. And the fact remains once factors are realized the interpretative element is subjective.
In addition to the heterogeneous character of the arithmetic subtest, Weiss et al. (2013b) found additional subtest cross loadings on the WISC-IV with the matrix reasoning loading primarily on inductive reasoning and secondarily on visual processing; picture completion loading primarily on Gv but also on Gc; and symbol search having a large loading on Gs but a small loading on Gv. The WAIS-IV subtests also had a number of cross loadings (Weiss et al., 2013a) with figural weights being primarily a measure of RQ and a secondary measure of Gv; and matrix reasoning being largely a measure of Gf and a minor measure of Gv.
Taken together it appears we are left with an irresolute five-factor model and a handful of dithering subtests. But this isn’t as problematic as one might suspect. No one is more aware that test factors don’t always “hang together” than those assessing children and adults on a daily basis. It is actually rather gratifying to see the results of these studies, which clearly illustrate that borders of the factors are porous and often subtests try to secede from their factors both individually and en masse. The Weiss et al. articles shed light on one reason why this occurs, which is simply that subtests can measure more than one factor. But even subtests that appear to cleanly measure a particular factor can go rogue for any number of reasons that make a direct interpretation impractical. To be clear, the practice of assessment requires the evaluator to have knowledge about the nomothetic data their tools were designed upon, but individual assessment is an idiographic endeavor, in that we are aiming to describe the particular individual being assessed. Idiographic assessment is not antagonistic to nomothetic data. On the contrary we attempt to explain the individual case within the broader context of general principles. More specifically in the case of test data we are charged with finding a convergence between the qualitative aspects of a particular case and the quantitative data derived from objective measures such as the Wechsler tests. The Weiss et al. articles provide examiners with some guidance in this regard. Since both the four and five-factor models are tenable, the examiner could use either, depending on the outcome of the individual assessment data. If, for example, in the course of examining the data from a WISC-IV evaluation the matrix reasoning, picture concepts, and arithmetic were significantly lower (or higher) than the block design and picture completion tests then it is reasonable to conclude that the five-factor structure better summarizes the individual assessment data. In cases when those subtests “hang together,” the five-factor model isn’t necessarily contraindicated, but the four-factor model might better summarize the data; though knowing your test is measuring five factors means that adding two more factors would give you a full complement of the seven canonical CHC broad factors. For those using the Wechsler tests that mean simply adding (sub)tests that include measures of Glr (long-term retrieval), and Ga (auditory processing) to yield the standard seven broad factors.
Dealing with subtests that load on more than one factor is a more challenging proposition to interpret, but studies such as those performed by Weiss et al. give the practitioner some empirically based hypotheses to consider. For example, the first iteration of evaluating the scores would be to examine the score profile using the primary loadings. In cases where no significant deviations occurred then a straightforward interpretation would be likely. For those cases where deviations occurred the examiner would then look to the secondary loadings and determine if the test was being pulled toward factors related to the secondary loading. In most cases, this iteration would need to be followed up with additional tests that measure the primary and secondary loadings of the rogue subtest. Once sufficient data has been gathered then it is the practitioner that ultimately interprets the scores based on a convergence of the data. To be sure, there are other considerations as well, such as a demands analysis and observations of various conative factors (e.g., motivation, drive) related to test performance, but there is no substitute for understanding what a particular test may be measuring for a given individual. That includes knowing that, in the case of the Wechsler tests, the structure behind the subtests is largely invariant across clinical and normative samples as found in the Weiss et al. articles. An important consideration given that we’re generally trying to interpret test data from evaluations of clinical populations.
Conclusion
Interpreting psychological data is largely an inductive task in that we are inferring general psychological principles from a specific case. Unlike deductive tasks our inferential judgments are probabilistic and therefore evaluated as either strong or weak based on the evidence. In practice, having stronger evidence means having both more evidence and having that evidence converge toward a smaller solution space of potential conclusions. In the world of conditional reasoning, “it is widely agreed that inductive activity depends more on knowledge of the domain than on formal properties of the premises.” (Politzer, 2003). So the strength of our interpretations rely on having intimate knowledge of both theory and the tools that operationalize those theories. Research such as that conducted by Weiss et al. provides the knowledge necessary to make reasoned inferences about test scores. Paul Meehl, in one of his many lamentations about the state of psychology as a science, once wrote “… in other sciences, powerful predictions can be mediated by theoretical constructs only when two conditions are met. First the theory is well worked out and well corroborated, having high verisimilitude, as Popper calls it; secondly, there exists a power technology of measurement.” (Meehl, 1979, p. 564). The extant literature on CHC theory arguably meets the first criteria and our tools have made a lot of gains toward meeting the latter. Studies like the ones in this series improve the power of our measurement technology by revealing important interpretative nuances that exist within subtests. That knowledge can then be leveraged into a better understanding of the strengths and needs of individual children. Through that understanding we can pursue interventions that are more individually tailored than we would otherwise using a single measure of intellectual ability that often means nothing obvious, while at other times, means obviously nothing.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
