Abstract
This article addresses conceptual and methodological shortcomings regarding conducting and interpreting intelligence test factor analytic research that appeared in the Decker, S. L., Bridges, R. M., Luedke, J. C., & Eason, M. J. (2020). Dimensional evaluation of cognitive measures: Methodological confounds and theoretical concerns. Journal of Psychoeducational Assessment. Advance online publication article.
Keywords
Decker, Bridges, Luedke, & Eason’s (2020) article “Dimensional Evaluation of Cognitive Measures: Methodological Confounds and Theoretical Concerns” was recently published online in the Journal of Psychoeducational Assessment. We read this article with great interest as it promised to present an applied demonstration of how two methodological approaches, the bifactor method (BF; see Brown, 2015), and the Schmid–Leiman procedure (SL; Schmid & Leiman, 1957), are problematic for structural validity research on intelligence tests. In our view, the article suffers from numerous conceptual and methodological shortcomings which we articulate below to better inform ongoing debates about these topics.
Conceptual Misunderstandings
Beginning in the abstract and continuing throughout the article, Decker et al. (2020) mischaracterized the BF and SL methods and misrepresented the results from studies using these procedures. These mischaracterizations are central to the philosophical narrative of the article and deserve correction. As one example, Decker et al. indicated that “subtest loadings on a general factor for a BF model (their Figure 2) are approximately equivalent to loadings for a unidimensional model (their Figure 1). This result occurs because a BF model estimates parameters for a general factor ‘as if’ it is a unidimensional model” (p. 19). This statement is not correct. Neither the BF method nor the SL procedure estimates parameters in this fashion. The SL procedure1 and BF modeling, or the results produced by these procedures, should not be construed as representing a unidimensional model of intelligence. Simply put, one cannot use a BF method, or its approximation (i.e., SL), without assuming a multidimensional structure (Bonifay, Lane, & Reise, 2017).
Additionally, Decker and colleagues incorrectly contended that the SL procedure “forces” indicators to load on a general factor. With the SL procedure, correlated group factors are prioritized in the first-order solution (with oblique rotation) and allowed to be freely estimated independent of the influence of g. The ability of the SL procedure to recover group factors is well established (Giordano & Waller, 2020). Decker and colleagues neither acknowledged nor discussed this mathematical capability. Brown (2015) and Wolff and Preising (2005) provided an accessible explication for applied users of how parameters are estimated using both procedures.
Further, Decker and colleagues’ literature review claimed to make the case that the SL and the BF are equivalent with the SL characterized as an exploratory version of the BF procedure. This characterization is incorrect but understandable, given how loosely terminology has been used in the literature (Beaujean, 2015). While Reise (2012) referred to the SL procedure as an approximate exploratory BF model, the SL procedure technically represents a mathematical transformation of the higher-order model using an elegant procedure to apportion the variance to the higher- and lower-order factors (Carroll, 1993; Gorsuch, 1983; Loehlin & Beaujean, 2016; Schmid & Leiman, 1957; Wolff & Preising, 2005). However, the SL procedure is not a true BF procedure but rather a reparameterization of the higher-order model. The BF model and SL procedure originate from different assumptions and carry with them their own set of strengths and limitations. Some of the researchers cited in Decker et al.’s Table 1 have clarified this distinction and moved away from labeling the SL as a BF procedure since it assumes a higher-order structure.
With this understanding, it is important to note that a true exploratory bifactor analysis (EBFA) procedure—using a BF rotation—was created by Jennrich and Bentler (2011). Whereas the SL procedure extracts p rotated oblique factors, applies a second-order analysis, and transforms the variance to be apportioned to the general and group factors, the EBFA procedure simply extracts p + 1 rotated factors and apportions variance to those dimensions. No acknowledgment or discussion of the EBFA procedure and how it differs from the SL method was provided in the Decker and colleague’s article.
Framing of the Literature
Decker et al. cited several studies (e.g., Chen, West, & Sousa, 2006; Maydeau-Okivares & Coffman, 2006; Schmiedek & Li, 2004) to make the case that the BF model erroneously fits better than rival models in simulations where the model is implausible for the data. The authors also claimed that the SL procedure and BF model suffer from additional methodological issues such as proportionality constraints, failure to account for unmodeled dimensional complexity, over-inflation of g variance, and underestimation of variance to group factors. In contrast, the article did not provide a discussion of the potential strengths and utility of SL and BF procedures. From their framing of the literature, Decker and colleagues appear to be opposed to the use of either the BF model or SL procedure for investigating the multidimensional structure of intelligence tests and for making interpretive recommendations for those measures. We respect Decker et al.’s right to maintain this position; however, their article provided a one-sided presentation to frame an argument that the SL/BF procedures are methodologically flawed.
While all modeling procedures have limitations, it is important to frame the limitations accurately and provide a balanced presentation of the literature and conclusions from research studies. For instance, Decker et al. cited Chen et al. (2006) as evidence that a BF model is problematic, but the conclusions from Chen et al. (2006) plainly provided the opposite conclusion as reflected in that article’s abstract: “The bifactor model allowed for easier interpretation of the relationship between the domain specific factors and external variables, over and above the general factor. Contrary to the literature, sufficient power existed to distinguish between bifactor and corresponding second-order models in one actual and one simulated example, given reasonable sample sizes. Advantages of bifactor models over second-order models are discussed” (p. 181).
Additionally, the proportionality constraint issue is only attributable to the SL procedure (and the higher-order model from which it is derived; Gignac, 2016). EBFA overcomes this issue, but it has been shown to “get stuck” in local minima, which can result in group factor collapse particularly with problematic indicators. These are all issues that several researchers on this commentary have uncovered and disclosed in the methodological literature (see Dombrowski, Beaujean, McGill, Benson, & Schneider, 2019).
Studies suggesting that model fit statistics are biased in favor of the BF model were cited throughout the article and leveraged as a central critique of the BF model (e.g., Mansolf & Reise, 2017; Murray & Johnson, 2013). Although this potential limitation is framed as a fatal flaw, some of the authors cited by Decker et al. to support this contention offer more circumspect conclusions. For example, Maydeu-Olivares and Coffman (2006) discussed limitations of the BF model yet noted that the BF model should be preferred when “when researchers are interested in domain-specific factors over and above the general factor and, particularly, when researchers are interested in their differential predictive validity” (p. 359). As another example, Murray and Johnson (2013) expressed a preference for the BF model when interpreting domain specific factors. This is the explicitly stated goal of most, if not all, of the studies listed in Decker et al.’s Table 1, as well as the interpretive strategies articulated in intelligence test technical and interpretation manuals. Finally, Decker et al. did not explicate the results from simulation studies that have countered their own claims that the BF model is biased (e.g., Chen et al., 2006; Morgan, Hodge, Wells, & Watkins, 2015). In fact, the Morgan et al. citation is referenced in a way (p. 2) that supports the Decker et al. position on these matters; however, our reading of the Morgan et al. simulation study does not support this contention. On the contrary, the Morgan et al. results clearly indicate that model fit indices are not biased in favor of the BF model when a BF model reflects the “true”2 model underlying the data.
Building a Narrative Around Interpretation of Research
Decker et al. reported a review of the factor analytic methods utilized by a particular group of researchers (see their Table 1). From this review, they made several unsubstantiated claims that are used to support their premises. For example, Decker and colleagues asserted that “no justification was given for the preference of a BF/SL model” (p. 4) for the Differential Abilities Scales, Second Edition (DAS-II; Elliot, 2007) exploratory factor analysis (EFA) conducted by Dombrowski, McGill, Canivez, & Peterson (2019). This characterization is inaccurate. Dombrowski and colleagues provided ample justification for the SL procedures used in their study (pp. 92–93). Similarly, in a later section of the article, Decker et al. claimed that McGill, Dombrowski, and Canivez (2018) committed what they regard as “a clear methodological oversight3” because they did “not take into consideration that 100% of the reviewed studies used BF/SL models” (p. 19). However, the authors failed to disclose that the table in question from that article was located in a section of the article dedicated specifically to the discussion of BF modeling relative to the estimation of coefficient omega in intelligence test research.
Decker and colleagues then alleged that the studies they evaluated demonstrated evidence of methodological bias because all employed the SL procedure or BF model estimation. The only thing that Decker et al. (2020) have actually demonstrated is that researchers who are well versed in the use of appropriate factor analytic methods have properly utilized recommended procedures when conducting these analyses. For example, Keith and Reynolds (2018) described higher-order and BF models as well as the SL procedure in their tutorial on the use of confirmatory factor analysis (CFA) with intelligence tests. It is unclear what alternative procedures Decker et al. would prefer to have been employed other than to eschew use of SL/BF methods altogether.
We also draw attention to Decker and colleagues omission of research that has supported the retention of a higher-order (HO) model at the expense of the BF model (e.g., Dombrowski, McGill, & Morgan, 2019; McGill, 2020) and other research that has impartially presented both the HO and BF models to readers when they could not be distinguished statistically (e.g., Canivez, Watkins, & Dombrowski, 2017; Canivez, Watkins, Good, James, & James, 2017; Dombrowski, Golay, McGill, & Canivez, 2018; Strickland, Watkins, & Caterino, 2015). Further, Decker et al. also overlooked studies that have acknowledged the existence of broad abilities beyond g (e.g., Benson, Beaujean, McGill, & Dombrowski, 2018). Results furnished in these studies cast substantial doubt on Decker et al.’s claims of systemic bias in the literature.
Based in part on their fundamental misunderstanding of the SL procedure and BF modeling, construing them as “equivalent to a unidimensional model of intelligence,” Decker et al. (p. 19) appear to have misinterpreted the conclusions of the findings reported in their Table 1 as evidence that this body of work has argued for a “unidimensional model of intelligence” (p. 4). This represents a logical fallacy in the form of a false equivalence. Instead, every one of the studies in Decker et al.’s Table 1 (and others published by some of the authors listed in that table) evaluated the tenability of a single-factor unidimensional g model, found it empirically lacking, and rejected it as inadequate. A review of Decker et al.’s Table 1 shows that each study supported the presence of 3–7 group factors (across multiple IQ tests) that is made clear in the actual studies themselves. The conclusion from these studies that primary emphasis should be placed on interpreting the Full Scale Intelligence Quotient (FSIQ) should not be conflated as suggesting that an instrument is unidimensional or that broad abilities do not exist. None of the studies listed in Decker et al.’s Table 1 have ever made that contention. Rather, the following statement by Schneider and Kaufman (2016) summarize well the philosophical position of most of the studies’ authors listed in Decker and colleagues’ Table 1: “Although no scholars believe that intelligence consists solely of the g factor, some believe that our intelligence tests are simply too crude to effectively isolate the smaller factors of ability” (p. 286).
It is important to highlight that the discussion of group factor interpretability involves more than just simple variance partitioning. The alignment of subtests with theoretically posited group factors—a requirement not discussed but tacitly acknowledged by Decker et al. when they disclosed that their EFA results suggested retention of a five-factor model, which diverges from the publisher theory—should also be considered when deciding to interpret indices derived from group factors. Additionally, metrics of interpretive relevance (e.g., omega-hierarchical [ωH] and omega-hierarchical subscale [ωHS] coefficients, construct reliability or replicability [H], and percentage of uncontaminated correlations) should also be considered. These metrics offer an additional vantage from which to view whether an instrument’s group factors are empirically suitable for interpretation (Dombrowski, 2020; Reise, Bonifay, & Haviland, 2013; Rodriguez et al., 2016a, 2016b; Selbom & Tellegen, 2019).
Questionable Methodological Procedures
Although the theoretical and conceptual errors within Decker et al.’s article are sufficient to give one pause, there are numerous methodological shortcomings (not all are presented in this commentary) that deserve highlighting as they were further used by Decker et al. to generate incorrect conclusions.
Varimax Rotation in EFA
Decker et al. did not disclose the rotation method applied in their EFAs; however, the summary data reported in their article were used to determine that an orthogonal rotation method, varimax, was employed to produce their findings.4 The use of varimax rotation5 was methodologically inappropriate, given the aims of the Decker et al. study as well as the nature of tests of intelligence (Gorsuch, 1983; Watkins, 2018). Jensen (1998) noted that an orthogonal rotation “expressly precludes the appearance of a g factor. With orthogonal rotation, the g variance remains in the factor matrix but is dispersed among all of the group (or primary) factors. This method of factor analysis is not appropriate for any domain of variables, such as mental abilities, in which substantial positive correlations among all the variables reveal a large general factor” (p. 73).6
Incorrect Interpretation of “g” Loadings
The most substantive flaw in the article is the misinterpretation/misrepresentation of their own results reported in Tables 5 and 6 of their article. Decker et al. examined a series of increasingly complex first-order models ranging from one to eight factors7 (1F–8F). Consistent with established practice (e.g., Jensen, 1998), the 1F solution was used to obtain initial g loadings from the first unrotated factor.8 Then, only the loadings from the first rotated factor extracted for Models 2F–7F were reported in their Table 5 and erroneously interpreted as representing a general factor. However, once a first-order factor solution is rotated, the loadings from the first factor no longer represent g loadings unless the solution is inherently unidimensional, which the authors acknowledge is not the case.9 “When rotation occurs, the variance associated with the first factor seems to disappear, but in reality, it has become the dominant source of variance in the now rotated factors” (Carretta & Ree, 2001, p. 332).
First Unrotated Factor Coefficients (g loadings) using Maximum Likelihood Estimation across Various WJ III Cognitive Factor Extractions.
*Improper solution because the extracted communality estimate for spatial relations was >1.00 indicating a Heywood case. A Heywood case warning was encountered when estimating Models 4F–7F.
It is not clear why Decker et al. overlooked the results of hundreds of studies spanning more than a century that consistently found a strong general factor, regardless of factor analytic methodology (e.g., Carroll, 1993; Jensen, 1998; Lubinski, 2004; Warne & Burningham, 2019). As summarized by Sternberg (2003), “the evidence in favor of a general factor of intelligence is, in one sense, overwhelming....One would have to be blind or intransigent not to give this evidence its due” (p. 374).
Lack of Disclosure in EFA
In addition to not disclosing which rotation method was used in EFA, factor loadings (coefficients) for the remaining extracted first-order factors in Models 2F–7F were not reported. This makes it difficult to determine whether the models are free from local identification issues which is a critical component of model evaluation in EFA (Gorsuch, 1983; Watkins, 2018). For example, a Heywood case (Spatial Relations) was encountered in Model 6F (see Table 1). Additionally, extracted factors that contain too few measured variables for identification, measured variables that cross-load, or measured variables that migrate to theoretically different factors were apparently not considered. The authors did not report encountering these issues in their analyses. It is certainly possible, in the models that they examined, that the Heywood case evaded extraction, but it does suggest that there is instability in these analyses.11
Lack of Disclosure in CFA
All of the models with group factors examined in the CFA analyses reported by Decker and colleagues are just identified which renders the models statistically indistinguishable. However, the degrees of freedom reported for their Model 5 are not consistent with what was expected (i.e., df = 70). Instead, the authors reported, “Aside from adding a degree of freedom to the analysis, the effect of the additional specification for the hierarchical model was likely to be minimal [emphasis added] since only one additional parameter was required” (p. 18). The specific parameter that was modified and the rationale for its specification were not disclosed, a clear violation of research reporting standards (Appelbaum et al., 2018).
Another major problem is apparent in Decker et al.’s Figure 2. There are no standardized path coefficients reported from Long-Term Storage and Retrieval (Glr) to Visual Auditory Learning or Retrieval Fluency, suggesting that these paths were near zero, and thus, Glr is not a viable group factor. Further, there is no indication whether the standardized path coefficients from Visual-Spatial Processing (Gv) to Spatial Relations (.10) and Picture Recognition (.10) were statistically significant, but even if so, they are trivial and indicate that Gv in this model is likely also untenable. Such omissions suggest disregard of local fit problems as an important aspect of judgment of model adequacy because CFA models should never be retained “solely on global fit testing” (Kline, 2016, p. 461). Recall that the EFA analyses reported by Decker et al. supported a five-factor model.
As it appears to have been necessary to modify the Woodcock Johnson III Tests of Cognitive Abilities (WJ III Cognitive; Woodcock, McGrew & Mather, 2001) theoretical model, we used the summary data reported by Decker and colleagues to examine the publisher proposed theoretically derived HO model. Specification of the HO model produced clear evidence of model misspecification in the form of a Heywood case (the second-order loading between Glr and g was 1.05 and Glr had a negative residual variance estimate [-.094])12 indicating an impermissible solution.13 We can only speculate as to what constraint Decker et al. applied to Model 5 to rectify the offending estimate as this was not disclosed, nor was the additional parameter made evident in their Figure 5. Thus, we examined the most plausible option and constrained the variance in Glr to 1.0 to restrict its ability to produce an out of bounds estimate. With the addition of this constraint (which added a degree of freedom), the resulting standardized loadings were identical to those reported in their Figure 5 and the global fit statistics approximately the same as those reported in their Table 7. Ironically, the HO model appears to have required the use of a constraint to be identified, a major source of contention by Decker et al. with respect to the BF model.
Finally, it remains unclear what the results reported in their Table 8 represent. Based on Decker et al.’s description, these appear to be standardized loadings (coefficients) on the g factor based on their Models 1 and 2 that should correspond to the loading coefficients reported in their corresponding Figures 1 and 2. However, the loadings for Verbal Comprehension are absent from their Table 8 without explanation.14 Of even greater concern, none of the values reported in their Table 8 are consistent with the “g” loadings reported for the corresponding Models 1 and 2. As a result, it remains unclear how these results were obtained. Even so, the variance explained by g is 48.7% and 39.9%, respectively, for their Models 1 and 2. Thus, the BF produced a general factor that is approximately 20% weaker than the unidimensional model. This loss of power is not trivial, illustrating that the models should not be regarded as equivalent. This is also evident in the fit statistics reported in their Table 7, which indicated that Model 1, in their analyses, is structurally deficient and should not be retained. Inadequacy of their Model 1 (g only) was also reported in every CFA study reported in Decker et al.’s Table 1 and studies conspicuously absent from that systematic review (e.g., Canivez, Watkins, & Dombrowski, 2017; Canivez, Watkins, Good, et al., 2017; Canivez, Watkins, & McGill, 2019; Dombrowski, McGill & Morgan, 2019; Fenollar-Cortes & Watkins, 2019; Lecerf & Canivez, 2018). In sum, it is unclear what the Decker et al. CFA results prove relative to determining whether the BF model is biased or the psychometric adequacy of the variance partitioning procedures that emanate from that model. It may be the case that “sometimes a method or concept is baffling not because it is profound but because it is wrong” (Sasso, 2001, p. 188).
Conclusion
The Decker et al. study is predicated on theoretical, philosophical, and statistical misunderstandings and contains serious methodological flaws. In our view, their conclusions are not supported by the analyses reported. In fact, the analyses were not designed to falsify their contention that the SL procedure and BF model are biased methodological choices. Accordingly, readers are strongly urged to take our rebuttal into account when evaluating the commentary and speculation stemming from their analyses and when citing Decker, et al. in future discussions on these matters.
Nevertheless, the theoretical and empirical questions raised by Decker et al. remain important points for future research. Researchers (e.g., Horn, 1968; Johnson & Bouchard, 2005; Kovacs & Conway, 2016) have presented alternative views of the structure of human intelligence that should be investigated. Additionally, “a plethora of fundamental questions about cognitive abilities—their structure, sources, and meanings” (Carroll, 1998, p. 22) remain unanswered for even the most widely accepted models of intelligence.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
