Abstract
Surveys reveal that many school psychologists continue to employ cognitive profile analysis despite the long-standing history of negative research results from this class of practice. This begets the question: why do questionable assessment practices persist in school psychology? To provide insight on this dilemma, this article presents the results of a content analyses of available interpretive resources in the clinical assessment literature that may shed insight on this issue. Although previous reviews have evaluated the content of individual assessment courses, this is the first systematic review of pedagogical resources frequently adopted in reading lists by course instructors. The interpretive guidance offered across tests within these texts was largely homogenous emphasizing the primary interpretation of subscale scores, de-emphasizing interpretation of global composites (i.e., FSIQ), and advocating for the use of some variant of profile analysis to interpret scores and score profiles. Implications for advancing evidence-based assessment in school psychology training and guarding against unwarranted unsupported claims in clinical assessment is discussed.
Keywords
Despite their ubiquity (see Kranzler et al., 2016), intelligence (IQ) test use and interpretation is controversial (e.g., Beaujean & Benson, 2019; Fiorello et al., 2007; McGill et al., 2018), with some questioning whether IQ tests should be used at all (Gresham & Witt, 1997; see Fletcher & Miciak, 2017 for a more nuanced perspective). Even a casual inspection of the literature reveals that there are numerous interpretive approaches that are available to aid clinicians as they navigate the complex array of primary and ancillary scores produced by modern IQ tests. For instance, there are numerous profile analysis-based interpretive systems (e.g., cross-battery assessment, levels-of-analysis approach) that, despite their popularity (Kranzler et al., 2020), are the subject of extensive discourse due to concerns over the “evidence-base” that has been extended by proponents to support their use.
A Brief Review of the Evidentiary Status of IQ Interpretation Strategies
It is beyond the scope of this article to discuss, at length, the evidentiary status of the various IQ interpretation strategies. However, this section serves as an overview of comprehensive reviews for interested readers and summarizes supporting evidence that certain interpretive strategies should be regarded as low-value practices. In de-implementation research, low-value practices are those that (a) lack evidence of effectiveness or are not efficacious, (b) are less effective or efficacious than another practice with the same function, (c) cause harm, or (d) are no longer necessary (e.g., McKay et al., 2018). With regard to (a) and (b), effective and efficacious are defined in the context of intelligence testing as strategies with diagnostic or treatment utility or incremental validity for predicting outcomes.
As such, we would conclude based on the available evidence that Stratum I (i.e., subtest-level; see Watkins, 2000, 2003) and Stratum II (composite-level; see McGill et al., 2018 and Watkins, 2000) profile analysis strategies are low-value practices because they (a) are not adequately supported by compelling empirical evidence and (b) alternative approaches such as low-inference assessment (e.g., CBM, functional assessment) better serve the intended function of these approaches to test interpretation (e.g., Fletcher & Miciak, 2017). Additionally, the practice of ignoring or giving little interpretive weight to Stratum III scores (i.e., global IQ scores) in favor of primary interpretation of Stratum II dimensions (e.g., see Kaufman & Lichtenberger, 2005); or, of disregarding the Stratum III scores when constituent parts are significantly different (see McGill, 2017; Schneider & Roman, 2017) lack evidence of incremental validity, and would also be classified as low-value practices. In a recent review evaluating psychometric and conceptual concerns regarding IQ test interpretive practices, McGill et al. (2018) concluded that primary interpretation of subscale scores may be misguided as independent structural validity studies indicate that many of these scores are not adequately located by popular IQ tests and, even when located, often lack sufficient unique reliable variance for confidant clinical interpretation. Furthermore, although modern IQ tests are multidimensional, results from independent factor-analytic studies indicate that most of the reliable variance at all levels of IQ tests is explained primarily by general intelligence and not by Stratum II constructs. These results replicate similar shortcomings noted in previous reviews (e.g., Watkins, 2000). Put simply, the numerous shortcomings identified in the body of literature weaken most, if not all, of the foundational assumptions undergirding the use of profile analysis techniques.
Consistent with Floyd and Kranzler (2019), McGill et al. (2018), and Watkins (2000), this conclusion does not entail that research programs investigating underlying theory (e.g. Cattell-Horn-Carroll [CHC]; see Schneider & McGrew, 2018), approaches (e.g., patterns of strengths and weaknesses), or aptitude by treatment interactions (ATIs) are unimportant research lines or pseudoscientific as a matter of course; but, that their ability to advance clinical practice is currently limited. In addition to the interpretation of composites and heuristics regarding when to interpret global IQs, many tests produce numerous ancillary scores that are not supported by theory (Beaujean & Benson, 2019). The lack of a theoretical basis and psychometric adequacy for interpretation makes interpretation of ancillary scores contraindicated. In sum, appeals to theory do not obviate the need to ensure that scores produced by IQ tests have a baseline-level of appropriate psychometric support. Prevailing ethical guidelines and codes (e.g., American Educational Research Association et al., 2014) make this point clear. Unfortunately, the psychometric information furnished in some test technical manuals does not even meet de minimus standards for adequate reporting, making the task of determining whether a test or individual scores should be used for high-stakes decision-making futile (McGill et al., 2020).
State of Practice and Training
Despite the long-standing psychometric and conceptual issues associated with interpretation of subscale scores in general, and the use of profile analysis methods in particular, surveys reveal that these interpretive practices remain in use. For example, Sotelo-Dynega and Dixon (2014) surveyed 323 practicing school psychologists and found that about half followed a levels-of-analysis approach and a quarter applied the cross-battery assessment framework. Additionally, about half (45%) of their participants disregarded the global IQ due to significant scatter and 1% reported never interpreting the global IQ score at all. However, more than half (56%) reported that they interpreted composite scores all of the time. More recently, Kranzler et al. (2020) surveyed 1,317 practicing school psychologists regarding their use of IQ test interpretation strategies as part of specific learning disability assessments. They found that most clinicians interpreted profiles of subtest scores (~69%) and/or index scores (~64%), or generally apply a levels-of-analysis approach (~29%). While the majority (~80%) of participants reported interpreting the global IQ, more than half (~62%) reported not interpreting the global IQ in the presence of scatter. These data may suggest a reversal of interpretation patterns from those Sotelo-Dynega and Dixon (2014) observed; alternatively, these data may reflect the differences in sampling methods and methodology employed by the two studies. Specifically, the differences observed may be due to Sotelo-Dynega and Dixon’s (2014) focus on general interpretation practices whereas Kranzler et al. (2020) limited their focus to cases where specific learning disability was the primary classification of concern. Perhaps there is a difference in how school psychologists interpret IQ tests when they use them to identify intellectual or developmental disabilities versus when they use them to identify specific learning disabilities. Regardless, both studies suggest that the interpretation of profiles, the interpretation of composite scores, and the disregard of global IQ in the presence of scatter are common practice.
While assessment practices come from a variety of sources, we focus on training experiences given that trainers maintain influence in that domain. Cook et al. (2009) surveyed 2,607 psychotherapists in the United States and Canada to identify variables that may influence clinical practices and the adoption of evidence-based practices. The most influential variables the authors identified were clinicians’ mentors, books, graduate training, and discussions with peers. These findings may generalize to school psychology, with two-thirds of clinicians reporting they used strategies learned during their graduate coursework and from test technical manuals (Sotelo-Dynega & Dixon, 2014). As graduate training and textbook exposure appear to have a significant impact on IQ test interpretation strategies, it is important to consider how students are taught to interpret such tests and the books assigned to facilitate and guide that coursework and future self-guided professional development once they enter the field.
Fortunately, two studies have described how IQ testing is taught in school psychology programs (Lockwood & Farmer, 2019; Miller et al., 2020). These studies investigated not only the textbooks that are commonly used, but also the type of the interpretive strategies that are typically taught within training programs. The results of these studies suggested that significant emphasis is placed upon IQ test cognitive profile analysis, and that the majority of sources cited to support this practice were not peer reviewed (e.g., McGill et al., 2018). The Lockwood and Farmer (2019) study surveyed 127 graduate trainers responsible for teaching coursework on IQ testing in school psychology programs. Results indicated more than two-thirds of trainers teach students to interpret Stratum II composites in isolation and about two-thirds teach students to compare those composites. This is consistent with additional data suggesting that approximately 69% of trainers teach some form of patterns of strengths and weaknesses analysis and 39% of trainers teach the “Intelligent Testing” framework first introduced by Kaufman (1979) over 40 years ago. In addition, Lockwood and Farmer (2019) found that approximately one-third of trainers teach students to interpret subtest scores and to compare subtest scores. These data seem to support the premise that low-value interpretive strategies continue to be taught in graduate coursework for IQ testing.
In addition, the understanding of which textbooks clinicians used in their graduate courses may further illuminate and inform why IQ test interpretation practices with little scientific support continue to be popular in practice. Miller et al. (2020) collected syllabi from 90 graduate trainers regarding their programs’ IQ testing course. Various versions of Sattler’s Assessment of Children: Cognitive Foundations; Flanagan et al., Contemporary Intellectual Assessment: Theories, tests, and issues; Kranzler and Floyd’s Assessing Intelligence in Children and Adolescents: A Practical Guide; Schrank et al., Essentials of WJ-IV Cognitive Abilities Assessment; and Flanagan and Alfonso’s Essentials of WISC-V Assessment were the most frequently required textbooks. Most of the frequently used textbooks on IQ testing provided a detailed explication of stratum II and stratum III analyses. The practices commonly described within these textbooks involve the following stepwise analyses: (a) interpreting Stratum II scores and profiles, (b) interpreting Stratum I scores and profiles, (c) disregarding Stratum III scores in the presence of scatter, and (d) interpreting ancillary scores (collectively referred to as low-value practices) despite a substantial amount of counterfactual evidence in some cases (e.g., McGill et al., 2018).
Purpose of the Present Study
Because there is a dearth of supportive, peer-reviewed research for (a) interpreting Stratum II scores and profiles, (b) interpreting Stratum I scores and profiles, (c) disregarding Stratum III scores in the presence of scatter, and (d) interpreting ancillary scores (collectively referred to as low-value practices) and substantial amount of counterfactual evidence available in some cases (e.g., McGill et al., 2018), and data that suggests these strategies continue to be explicitly taught in graduate programs (Lockwood & Farmer, 2019), we hypothesized that non peer-reviewed sources (i.e., textbooks and test manuals; henceforth, instructional materials) overwhelmingly recommend these interpretive methods be used in clinical practice. Historically, recommendations of low-value practices have been included in some test manuals (e.g., Wechsler, 2014) and some frequently used textbooks (e.g., Essentials of Cross-Battery Assessment) (Miller et al., 2020) center on these practices. However, it is unclear to what extent the guidance available in instructional materials for the most commonly administered IQ tests (see Benson et al., 2019; Sotelo-Dynega & Dixon, 2014) align with available peer-reviewed research evidence pursuant to these matters (Lilienfeld et al., 2006).
The purpose of this study was to evaluate available instructional materials to identify the interpretive procedures recommended for clinicians within and between contemporary IQ tests. Although the major goal was to classify themes related to the interpretive guidance featured most prominently across available resources, isolating and amplifying where particular aspects of interpretive practices may have evolved was also included. The present investigation yields information for trainers and assessment scholars when considering which resources to adopt for future intelligence assessment courses.
Methods
The present study employed a content analysis approach (Hsieh & Shannon, 2005) to code instructional materials that were selected for inclusion. Target resources included prominent books, chapters, test technical manuals, and third-party guidebooks (i.e., the Essentials series) for the five most-commonly used commercial ability measures at child-age (see Benson et al., 2019; Sotelo-Dynega & Dixon, 2014). These measures included the Wechsler Intelligence Scale for Children-Fifth Edition (WISC-V), Woodcock-Johnson IV Tests of Cognitive Abilities (WJ IV COG), Kaufman Assessment Battery for Children-Second Edition (KABC-II), Differential Ability Scales-Second Edition (DAS-II), and Stanford-Binet Intelligence Scales-Fifth Edition (SB5). To ensure adequate saturation, internet and library searches were conducted in the Fall of 2019 using each tests’ acronym and for general intellectual assessment textbooks; one textbook was updated in 2020. Additionally, the reference lists from chapters for individual tests were also screened to locate other potential sources for inclusion in the present review. In order to be included in the present study, the resource had to systematically describe clinical interpretation procedures (i.e., step-by-step) for use with a particular instrument or across instruments. In total, 34 instructional materials were selected for inclusion. Chapters and general frameworks were identified for inclusion, even when those materials were present in the same resource (e.g., Sattler, 2018). A systematic framework was then developed to code the instructional materials (see Table 1).
Description of the Classification and Coding Framework Employed in the Present Study.
Note. The levels of analysis approach (e.g., Kaufman et al., 2016) generally encourages the clinicians to interpret scores in a step-wise fashion beginning with Stratum I and culminating at Stratum III.
If available.
Results
Of the 34 interpretive resources identified providing step-by-step interpretive guidelines, seven focused on the WISC-V; five focused on the WJ IV COG; five focused on the KABC-II; four focused on the DAS-II; six focused on the SB5; and seven provided general guidance on cognitive test interpretation. Sattler (2018) was included for the WISC-V, WJ IV COG, DAS-II, SB5, and general guidance reviews as separate chapters for these instruments were provided that included unique guidance by each test consistent with the framework for interpretation discussed throughout the textbook. Details of coding for each included resource are organized by test or general guidance and are presented in Tables 2–7. For brevity, we will focus on instructional materials that provide general guidance.
Interpretive Guidance for the Wechsler Intelligence Scale for Children-Fifth Edition (WISC-V; Wechsler, 2014).
Note. Asterisk (*) indicates test technical manual.
Although it is suggested that scatter does not impact the validity of a score as a matter of course, users should evaluate the “cohesiveness” of an indicator to determine if the score should be regarded as clinically meaningful.
Interpretive Guidance for the Woodcock-Johnson IV Tests of Cognitive Abilities (WJ IV COG; Schrank et al., 2014).
Note. Asterisk (*) indicates test technical manual.
No specific interpretive guidance is given although it is suggested that interpretive focus will vary depending on the purposes of an evaluation and that all clusters, scores, and tests are interpretable.
Interpretive Guidance for the Kaufman Assessment Battery for Children-Second Edition (KABC-II; Kaufman & Lichtenberger, 2005).
Note. Asterisk (*) indicates test technical manual.
Interpretive Guidance for the Differential Ability Scales-Second Edition (DAS-II; Elliott, 2007).
Note. Asterisk (*) indicates test technical manual.
Interpretive Guidance for the Stanford-Binet Intelligence Scales-Fifth Edition (SB5; Roid, 2003).
Note. Asterisk (*) indicates test technical manual.
Guidance Offered in General Cognitive Assessment Interpretive Texts and Resources.
Note. Asterisk (*) indicates test technical manual.
Seven different resources were identified that provided general guidance rather than test-specific guidance. Of those, three (Canivez, 2013; Glutting et al., 2003; Kranzler & Floyd, 2020) encouraged clinicians to focus their interpretation mostly on Stratum III scores. Three (Flanagan et al., 2008, 2013; Hale & Fiorello, 2004) took the opposite position, discouraging interpretation of Stratum III scores, and instead encouraged clinicians to focus their interpretation on Stratum II scores. One (Miller & Maricle, 2019) did not address Stratum III scores at all and encouraged a levels of analysis approach with primary emphasis at Stratum II. Three of these resources (Flanagan et al., 2013; Hale & Fiorello, 2004; Miller & Maricle, 2019) invoked the variability hypothesis at Stratum II—Miller also expressed concerns for ancillary composites. With regard to ancillary composites, most authors either did not address them at all or suggested they should be interpreted with caution. Flanagan et al. (2008) encouraged their interpretation as part of a step-by-step, levels of analysis approach, though retreated to a more cautious position in future resources (Flanagan et al., 2013). Miller and Maricle (2019), however, suggested that ancillary scores should be interpreted as part of a levels of analysis approach, and that the interpretive value of ancillary scores were superior to other scores. Most (excluding Hale & Fiorello, 2004; Miller & Maricle, 2019) discouraged the development of inferences from individual items, whereas all who mentioned test session behavior encouraged the generation of inferences.
Discussion
The present examination identified several themes in instructional materials that seem to support many of the low-value IQ test interpretation practices employed by clinicians (Kranzler et al., 2020; Sotelo-Dynega & Dixon, 2014) which remain a staple in many training programs (Lockwood & Farmer, 2019). First, the majority of the instructional materials surveyed recommended that Stratum II scores should be the primary focal point for clinical interpretation and that interpretation of omnibus, full scale scores was often de-emphasized as a result, despite that the majority of IQ test variance partitioning research clearly shows that general intelligence explains the vast majority of variance in most of these indicators (e.g., Dombrowski et al., 2021) and Stratum II dimensions almost never contain sufficient portions of unique variance for confident interpretation (Canivez & Youngstrom, 2019). Only three resources specifically recommend against Stratum II score interpretation and profile analysis methods more generally. The homogeneity of interpretive strategies presented across the instructional materials is disconcerting but may well predict trends in practice and instruction.
Despite evidence contradicting the use of Stratum I interpretation being available since the 1990s (see Watkins, 2000) and former advocates recommending against interpretation at this level (e.g., Kaufman et al., 2016) entirely, interpretive guidance regarding Stratum I varied across instructional materials with a narrow majority (56%) suggesting such subordinate scores are interpretable in a levels of analysis approach. Others recommended Stratum I be interpreted with caution while one (Schrank et al., 2016) suggested that their interpretive value was superior to other scores. Similar results were obtained regarding the variability hypothesis. Whereas the vast majority of resources encouraged examiners to forgo interpretation of composite scores in the presence of significant test scatter, more recent resources (e.g., Drozdick et al., 2018; Flanagan & Alfonso, 2017) noted shortcomings associated with this practice. Of note, Drozdick et al. (2018) deserve special mention as they reversed course and wrote that this specific practice was likely un-categorically unsupported based on recent research (e.g., McGill, 2017). In total, these findings suggest that some self-correction has occurred in the last 20 years with respect to subtest analysis and the variability hypothesis. However, the results of the present study illustrate that despite this positive momentum, the vast majority of instructional materials continue to recommend low-value practices.
In sum, results from the present study suggest that the information contained within popular textbooks and manuals does not always align with the assessment practices recommended in the peer-reviewed literature, with some instructional materials promoting a greater amount of non-empirically supported practices than others. Unfortunately, lack of scientific self-correction in academic textbooks is not uncommon; particularly, in texts that focus on the status and functioning of human intellectual abilities and their measurement (Warne et al., 2018). This is consistent with the insight of Meehl (1978) who noted that popular clinical practices are passed down from generation to generation of practitioners through clinical lore and become almost immune to self-correction. Indeed, several of the instructional materials reviewed in this study have been mainstays in cognitive assessment coursework for decades (e.g., Sattler’s and Kaufman’s textbooks are prominent now (Lockwood & Farmer, 2019; Miller et al., 2020), and were prominent in both the 1980s [Oakland & Zimmerman, 1986] and 1990s [e.g., Alfonso et al., 2000]) with little change in the overall interpretive recommendations offered through those resources across the decades despite accumulating contrary evidence (e.g., Watkins, 2000). The aforementioned minor change may then be due to a collective loss of interest in a specific set of practices rather than an accumulation of the evidence-base (Meehl, 1978). Instead, low-value practices seem to be reified and recycled (McGill et al., 2018).
Trainers should consider whether instructional materials have been responsive to the research literature when adopting course materials and are encouraged to give greater consideration to empirical resources (i.e., peer-reviewed articles) where countering evidence is presented and discussed. Continued reliance on conventional training resources will likely perpetuate low-value practices and a preference for assessment practices that have been empirically questioned and, in some cases, discredited in the literature (Truscott et al., 2004; Youngstrom, 2013). Given this predicament, in the spirit of Meichenbaum and Lilienfeld (2018), we present a provisional list of potential warning signs for hype in the clinical assessment literature as well as an annotated bibliography of seminal resources on these matters (https://osf.io/cs9jz/?view_only=7b59c13393c1440a954f0c8871ff5ab9) as a safeguard against the adoption and perpetuation of contraindicated practices.
Limitations and Future Research
The following limitations should be considered. First, selection of instructional materials may have been incomplete or biased in some way (e.g., overlooking texts from other fields, including clinical psychology or I/O psychology). While this may be true and future research should address any gaps in inclusion criteria, there was significant overlap with the instructional materials identified and those identified as commonly required or recommended by instructors (cf. Miller et al., 2020). Second, a premise of this study was that several common interpretive practices (see Sotelo-Dynega & Dixon, 2014; Kranzler et al., 2020) lack adequate empirical support, and therefore qualify as low-value interpretive practices. While this premise is well supported (see Cohen, 1959; McGill et al., 2018; Watkins, 2000, 2003), different evidentiary criteria may result in the inclusion and exclusion of various practices. For example, it may be that some researchers may argue that simulation studies or factor analysis are poor evidence for utility and validity (e.g., McGrew, 2018). Third, guidance by referral concern are not separated (e.g., guidance for the identification of intellectual disability versus guidance for the identification of specific learning disability). While doing so may lead to greater clarity, instructional material either contained or did not contain guidance on low-value practices and so it was decided not to approach the task in this manner. Finally, while there is evidence that the identified instructional materials are used by trainers (Miller et al., 2020) and that trainers are teaching about low-value interpretive practices (Lockwood & Farmer, 2019), it is not clear whether trainers are providing this content because of specific state or district policy and also including discussion of counterfactual evidence and caution. Future researchers may be interested in exploring the context in which these materials and strategies are taught in graduate coursework.
Conclusion
Instructional materials such as assessment-specific textbooks and technical manuals potentially serve as “pedagogic vehicles” (Kuhn, 1967, p. 137) for low-value practices such as cognitive profile analysis, the interpretation of ancillary scores, or other scores or comparisons. Given the potential influence of textbooks and graduate coursework on the long-term professional behavior of clinicians (Cook et al., 2009; Sotelo-Dynega & Dixon, 2014), trainers who teach cognitive assessment should be aware of this risk and wary about how such content is presented to students. However, if history provides any indication, it is likely that the influence of non-empirical resources will continue to countervail the influence of emerging evidence-based assessment movements in scientific psychology (e.g., Youngstrom, 2013). Lilienfeld et al. (2017) contend that the incorporation of scientific thinking into graduate training may help to reduce the scientist-practitioner gap, and thus trainers have a responsibility to foster scientific thinking. Accordingly, we encourage trainers to explicitly teach students to detect inflated claims in the assessment literature (see Lilienfeld et al., 2012; Meichenbaum & Lilienfeld, 2018) and to select material that incorporates the peer-reviewed literature, including peer-reviewed articles themselves, to inoculate against hype in the clinical assessment literature.
Footnotes
Acknowledgements
Special thanks to Ashley Hale for her assistance with formatting and copyediting.
Authors’ Note
Preliminary results were previously presented at the 2019 meeting of the Trainers of School Psychologists, Atlanta, GA.
Declaration of Conflicting Interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: R.L. Farmer contributed to one textbook (Kranzler & Floyd, 2020) and R.J. McGill served as a reviewer for that textbook and received a free copy of that textbook by the test publisher for those efforts. G.L. Canivez contributed to a second textbook (Canivez, 2013) reviewed in this manuscript.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
