Abstract
For over three decades confirmatory factor analysis (CFA) has been used to test the construct validity of models of posttraumatic stress disorder (PTSD). The four symptom dimensions of PTSD in the fifth edition of Diagnostic and Statistical Manual of Mental Disorders (DSM–5) are based on CFA. Since the publication of DSM–5, the number of proposed factors has grown from four to seven. We review these models, focusing on (a) the number of symptoms per factor, indicating how well factors are identified; (b) correlations between factors, indicating how distinct they are; and (c) their external validation. Of the 27 CFAs published since 2013, almost all included factors composed of only two symptoms, and most relied on more than one. High factor correlations were the norm. Two thirds of models provided external validation. Discussion concerns implications for PTSD’s measurement and construct validity and recommendations for improving CFA in the PTSD literature.
Keywords
Confirmatory factor analysis (CFA) has been used to assess the construct validity of posttraumatic stress disorder (PTSD) since shortly after the codification of the diagnosis in the third edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM–III; American Psychiatric Association, 1980; e.g., Silver & Iacono, 1984). With the advent of the fourth edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM–IV; American Psychiatric Association, 1994), CFA studies of PTSD became important in the way the field conceptualized the diagnosis and how clinicians assessed the presence or absence of the disorder (for a thorough review, see Elhai & Palmieri, 2011). Since the publication of the fifth edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM–5; American Psychiatric Association, 2013), CFA models have become increasingly multidimensional. As the number of dimensions proliferates, so too does reliance on poor modeling practices, of which we focus on three: (a) underidentified factors, (b) highly correlated factors, and (c) use of fit statistics alone to select models.
Presenting the misuses of latent variable models in PTSD research may not strike readers as particularly novel. Indeed, almost half a century ago, David Lykken (1971) voiced skepticism of the uses of factor analysis to justify explanatory models in clinical psychology. More recently, many colleagues within our field have nodded in agreement when we have discussed the criticisms included here. And yet these practices persist not only in the analysis of PTSD but in other areas of clinical psychology as well. We hope that this review will be used by researchers, editors, and reviewers alike to help curtail practices that are problematic at best. To help promote good modeling practices, we first give a brief conceptual introduction to CFA and key issues related to poor modeling practices. Next, we describe trends in CFAs of PTSD before and after the publication of the DSM–5 (American Psychiatric Association, 2013). We then review the best-fitting CFA models of PTSD published since the publication of the DSM–5. Several reviews have drawn on pre-DSM–5 history to argue for specific models for DSM–5 PTSD (e.g., Armour, Műllerová, & Elhai, 2016; Elhai & Palmieri, 2011); we do not. Instead, we examine the extent to which the literature since DSM–5 follows good practices in its use of CFA and what this says about the measurement and construct validity of PTSD. We close with a list of recommendations for good practice in CFA modeling of PTSD.
A Brief Introduction to CFA
Within the broader family of generalized structural equation models, CFA is one of several reflective-indicator measurement approaches that include latent class analysis (LCA) and item response theory (IRT) models, among others. Sources such as Bartholomew, Knott, and Moustaki (2011); Kaplan (2008); and Skrondal and Rabe-Hesketh (2004) contain very thorough technical reviews of such models, including generalizations to binary and ordinal data and careful discussions of topics such as model estimation and identification. We refer readers to Kline (2015) and Brown (2015) for discussions written for applied researchers. For simplicity, we will describe these procedures as CFA inclusively to refer to both linear CFA and models such as ordinal CFA or the equivalent IRT models that are increasingly used to account for the ordinal indicators found in common practice.
In reflective indicator models, indicators reflect latent variables that cannot be measured directly. These latent constructs are assumed to be the cause of relationships between observed indicators. In clinical research, indicators are usually responses to survey or questionnaire items. For example, responses to PTSD items concerning recurrent thoughts of traumatic events, nightmares, flashbacks, and responses to trauma cues are all associated with one another, and this suggests that responses to these items reflect a singular, underlying latent psychological phenomenon—namely, reexperiencing. The association of a set of indicators with one another comprises their common covariance; statistically, this is represented as a factor. Were we to observe the factor directly and regress the items on it, the items would be independent. Just as each indicator has covariance that is shared with others, each indicator also has variance that is not shared with others—its unique variance. Indicators could also have systematic variance that is not primarily due to the substantively interesting general phenomenon, such as those often found when negatively worded items hang together more strongly than expected because of wording effects.
Indicators are associated with factors and have factor loadings representing the degree of this association. In most cases, standardized factor loadings range between −1 and 1 and can be interpreted as correlations between indicators and the correlation of all other indicators. In purely descriptive modeling (e.g., exploratory factor analysis), it is common that indicators load onto more than one factor, or cross-load; however, usually indicators will load on one factor substantially more than on others. The general preference in CFA modeling is for indicators not to cross-load, a point we will discuss in the following.
Each element of a model—for example, factor loadings, unique variances, correlations between factors (see the following)—is known as a parameter. Models with more indicators and more factors have quadratically more potential parameters, but CFA typically restricts the number of free parameters. Current guidelines suggest that CFA requires large sample sizes to generate stable parameter estimates, but large sample sizes are not always sufficient. Although psychometricians differ as to the number of participants needed to estimate each parameter well, all agree that the issue is not trivial (e.g., MacCallum, Browne, & Sugawara, 1996; Muthén & Muthén, 2002). Furthermore, as MacCallum, Widaman, Zhang, and Hong (1999) argue, even if the number of participants is large, the problem of poorly identified factors remains a concern and standard hypothesis tests become overpowered, meaning that a good-fitting model often appears to fit poorly for trivial reasons. Dillon, Kumar, and Mulani (1987) note that poorly identified and/or misspecified models are also much more likely to exhibit improper solutions, also known as Heywood cases, in which parameter estimates are illogical, such as a correlation with magnitude greater than 1.0 or a negative variance. As they note, “The problem can become more acute as the complexity of the model increases” (p. 128). Bartholomew, Knott, and Moustaki (2011) indicate that smaller samples exacerbate the problem of improper solutions.
Model Fit and Selection
In CFA, indicators are assigned to factors according to their conceptual content—namely, modeled—and CFA measures the fit of these assignments by examining the residual covariances between assignments and the data (e.g., participants’ responses). If residual covariances are large (or at least a few are large), the model is said not to fit the data well; if residuals are small, then the model has good fit. Because the number of covariances is large (e.g., for 20-item PTSD indicators, there are 190 covariances), various fit statistics have been developed to help researchers assess fit. Commonly used fit statistics include standardized root mean square residual (SRMR), Tucker-Lewis Index (TLI), Comparative Fit Index (CFI), and root mean square error of approximation (RMSEA). Although all are based primarily on the size of residual covariances, they are not always consistent with one another. For reviews of fit statistics, see Kline (2015) and Skrondal and Rabe-Hesketh (2004). RMSEA tends to be the strictest criterion. It has therefore become the decisive standard, with conventional benchmark criteria of 0.08 for acceptable and 0.05 for good fit. Unfortunately, as F. Chen, Curran, Bollen, Kirby, and Paxton (2008) noted, there is no clear theoretical support for the 0.08 and 0.05 conventions even in the linear model. Xia and Yang (2018) argue RMSEA is biased too low when weighted least squares is used with ordinal items. This is important to clinical psychology because indicators are typically ordinal. By contrast, Browne, MacCallum, Kim, Andersen, and Glaser (2002) show that even in the linear factor model, RMSEA is excessively sensitive to small discrepancies from otherwise good models, ironically most strongly when indicators have high reliability! Finally, McDonald (2010) notes that global fit statistics often mask important misfit in models that are localized, for instance, a poorly fitting structural model being swamped by a good fitting measurement model.
Best fit is based on maximizing the amount of variance accounted for—regardless of whether the variance is due to the construct(s) of interest or random or systematic error. More complex models will necessarily account for more variance but potentially at the cost of modeling variation irrelevant to the phenomena under investigation (i.e., overfitting). Fit statistics cannot distinguish between relevant and irrelevant variance, so it is incumbent on researchers to select models using theory and external variables in addition to fit statistics.
Number of Indicators and Identifying Factors
As mentioned previously, factor loadings represent the degree of association that indicators have with the shared covariance among a larger group of indicators. In CFA, indicators are assigned a priori to individual factors to test these associations. Indicators with greater covariance load best onto the same factor. In general, more indicators per factor are preferred to identify strong factors (Marsh, Hau, Balla, & Grayson, 1998; Thurstone, 1948). However, mathematically it is possible that two similar indicators will suggest a factor on their own because they share a great deal of covariance that is not shared by other indicators. Consequently, they exhibit higher residual correlation than is consistent with the factor model. These two highly correlated indicators will have association remaining after accounting for factor loadings and thus may, together, look like a separate factor—namely, simply because their correlation with each other is stronger relative to their correlation with other items. There are a number of possible reasons why items may covary excessively, but our belief is that the most common reason is found in standard scale construction practices. Indeed, it would be odd if there were not two items correlated with one another more highly than with others, typically because of wording. This means that researchers using factor analysis will frequently identify two-item factors. But two items are insufficient to identify a factor on their own for conceptual and not purely statistical reasons.
Consider three items from the Posttraumatic Checklist–5 (PCL-5): “being ‘superalert’ or watchful or on guard,” “feeling jumpy or easily startled,” and “having difficulty concentrating” (Weathers et al., 2013). The responses to the first two items should be more highly associated with one another than with responses to the third. Does the stronger association between these two reflect a latent factor that the third does not? With only two items, it is impossible to say. Because among a given set of items two are likely to be more highly associated with one another than with others, two-item factors do not allow researchers to distinguish between theoretically versus empirically driven results. The existence of two-item factors suggests two possible situations: (a) A potential factor exists that would require more indicators to be identified on its own, or (b) two items are simply strongly correlated versions of the same nonlatent phenomenon because of similarity or causality between them. If the former situation were true of the three items from the PCL-5, this would mean separate factors might exist for physical and psychological hyperarousal. The latter condition, two items that are conceptually very close or causally related and thus strongly correlated, is referred to as a doublet (McDonald, 1999, 2004). If this situation were true of the PCL-5 items, the stronger relationship between the two physical hyperarousal items would reflect only their surface similarity. Doublets are evidence of empirical association but are silent about the conceptual status of the factor. Indeed, doublets are one of the most pervasive causes of improper solutions. Unfortunately, there is no easy way to distinguish statistically between doublets and two-indicator factors (which are simply poorly measured). Instead, they must be examined on substantive grounds, and more indicators need to be developed to measure them better.
Recognizing the problems of two-indicator factors, methodologists have set criteria for the number of indicators needed to identify factors (Marsh et al., 1998; Skrondal & Rabe-Hesketh, 2004; Thurstone, 1948). With only two indicators, a factor is underidentified. This means that a factor model cannot be estimated using just two indicators without imposing additional constraints. Three indicators are necessary for a factor to be just-identified, meaning that the factor model is estimable. However, in this case, it fits the data perfectly. Only when there are more than three indicators is a factor overidentified. All other things being equal, it is better to have more indicators per factor (Marsh et al., 2010) because only overidentified models are testable. Two-indicator factors are thus underidentified and cannot provide evidence that theoretically strong factors exist on their own. The overall model may be identified, but the part of it that is important to determining whether a particular factor is sensible or not is not.
In exploratory research contexts, in which the purpose is often to provide conceptual signposts for phenomena that should be better differentiated in future research, methodologists use conventions to infer the validity of two-indicator factors. In multifactor models (like PTSD), one such convention is that a two-indicator factor is identified as long as it is sufficiently correlated with other factors. However, researchers will often need to use modifications such as equating loadings, which makes the model equivalent to a correlated uniqueness model. The number of such modifications necessary to generate an identified and/or good-fitting model clearly affects its generalizability—a considerable problem when such models are applied to clinical settings.
Finally, a factor defined by only two indicators is not likely to have good reliability. The factor analytic reliability coefficient, often called McDonald’s omega (McDonald, 1999), depends on both the loadings and unique variances. Much like other reliability coefficients, it measures the ratio of systematic variance associated with the factor to the total variance. All other things being equal, the reliability of a factor is larger when there are more indicators. Again, in more exploratory research contexts, this may not be enough of a problem to make a model suspect; however, when factors are used as the bases of high-stakes decisions, such as a clinical diagnoses, reliability must be high (Bandalos, 2018). In sum, while two-indicator factors may have utility in exploratory research and arguably may be appropriate in other contexts in which structural equation modeling (SEM) is applied, clinicians need to be wary of making diagnostic decisions based on such factors.
Distinguishing Between Factors
In addition to indicators’ covariation being the basis of factors, factors themselves typically covary with one another. The correlation of latent variables within clinical models is interpreted like any other correlation, and in the clinical psychology literature, these correlations are typically positive and strong. Depression factors tend to be correlated above .70 (Osman et al., 1997; Ward, 2006), with some studies reporting substantially higher values (e.g., Van Dam & Earleywine, 2011). One might expect factor correlations to be strong between dimensions of clinical phenomena relative to correlations between clinical phenomena and predictors because symptoms of emotional distress are often reported simultaneously. This may be due to transdiagnostic processes underlying all psychological disorders (Barlow, Allen, & Choate, 2004; Caspi et al., 2014), respondents’ inability to make clear distinctions between symptoms, or both.
CFA factors may be correlated for statistical reasons as well. As previously mentioned, CFA modelers tend to prefer that indicators not cross-load. One factor per indicator has always been the case in the PTSD literature, reflecting theoretically clean DSM symptom criteria. However, indicators can and often do predict significant variance in two or even more latent constructs. For example, responses to an item about trauma-related amnesia might indeed be associated with the negative alterations in cognitions and mood (NACM) factor in a DSM–5 four-factor model, but it might also explain some amount of covariance in an avoidance factor as well. Not modeling cross-loadings means that some amount of covariance between indicators and the factors to which they contribute less but still significant variance is suppressed. Suppressed covariance at the indicator level tends to leak into higher correlations between factors. Also, as noted by Dillon et al. (1987), failing to account for cross-loadings may also induce improper solutions.
The implications of high factor correlations are that large parts of supposedly distinct factors are accounted for by other factors. This in turn suggests that the latent constructs in question lack a considerable degree of construct validity. At extreme values, they represent what is known in the modeling literature as factor collapse. Factor collapse occurs when two factors are so highly correlated with one another that they are basically indistinguishable. This can manifest in a number of ways, depending on how the model has been parameterized. Factor correlations may get very high if the factor variances are set to 1.0; alternatively, if a unit loading anchoring constraint is used that allows factor variance to be estimated, factor variance may approach zero.
CFAs of PTSD Prior to DSM–5
Immediately prior to the revision of DSM–5 (American Psychiatric Association, 2013), one set of findings in the CFA literature was clear: the three-factor (3F) model of PTSD in the DSM–IV (American Psychiatric Association, 1994) consisting of reexperiencing, avoidance/numbing, and hyperarousal factors was statistically inferior to two different four-factor (4F) models. King, Leskin, King, and Weathers’s (1998) 4F emotional numbing model divided avoidance/numbing into separate avoidance and numbing factors and left reexperiencing and hyperarousal as is. Simms, Watson, and Doebbeling’s (2002) 4F dysphoria model similarly left reexperiencing intact and separated avoidance and numbing symptoms but combined the latter with three hyperarousal items for a general dysphoria factor, with the remaining symptoms comprising a leaner hyperarousal factor. Prior to DSM–5, there was also support for a five-factor (5F) dysphoric arousal model that divided hyperarousal into two factors (Elhai Biehn, et al., 2011), but in general, 4F models were dominant. Four-factor models were so successful they became the basis for the proposed revision of PTSD for DSM–5 (Friedman, Resick, Bryant, & Brewin, 2011). The DSM–5 model reflected an amalgam of the King et al. (1998) and Simms et al. (2002) models. Reexperiencing symptoms remained unchanged, avoidance and numbing were split, numbing was broadened to encompass a wider range of dysphoria symptoms and relabeled NACM, and hyperarousal was left conceptually similar. In addition, the number of symptoms in the new third and fourth symptom criteria was increased by three; this is unrelated to CFA findings but could have an impact on the multidimensionality of future models.
Three trends in CFA modeling prior to DSM–5 (American Psychiatric Association, 2013) are notable. First, the symptoms of reexperiencing remained together in all models. Second, beginning with the King et al. (1998) 4F model, avoidance was modeled as a single factor using the two symptoms that reference avoiding. Third, CFA modeling following King et al. divided up what had been numbing and hyperarousal symptom clusters into various smaller factors. With the exception of trauma-related amnesia, none of these symptoms explicitly referenced trauma events.
CFAs of PTSD Since DSM–5
Whereas CFAs were already quite common in the PTSD literature prior to the publication of DSM–5 (American Psychiatric Association, 2013), since 2013, their rate of publication has increased. In addition to there being more CFA studies per year, the models proposed in these studies have included more factors, continuing the trend in increasing multidimensionality. Armour and colleagues’ review (Armour, Műllerová, et al., 2016) identified five-, six- (6F), and even seven-factor (7F) models that had been tested and supported in the literature by 2015. In addition to testing the 4F emotional numbing model (by using NACM in place of numbing), other models included a rearrangement of the DSM–IV 5F dysphoric arousal model in which one symptom of hyperarousal was incorporated into the dysphoric arousal factor (Demirchyan, Goenjian, & Khachadourian, 2015); a 6F anhedonia model that divided NACM into anhedonia and general negative affect (Tsai, Armour, Southwick, & Pietrzak, 2015); a 6F externalizing behavior model that divided the six DSM–5 hyperarousal items into three factors of two items each, externalizing behavior, anxious arousal, and dysphoric arousal (Tsai, Harpaz-Rotem, et al., 2015); and a 7F hybrid model that combined the 6F anhedonia and 6F externalizing behavior models (Armour et al., 2015).
In the current review, we examined patterns among best-fitting models of PTSD using DSM–5 (American Psychiatric Association, 2013) symptoms. We paid particular attention to the number of indicators per factor and correlations between factors because it was our hypothesis that the proliferation of factors observed since the publication of DSM–5 was premised on accepting underidentified and highly correlated factors. In addition, we examined whether and how researchers externally validated their preferred models.
Methods
We first searched for CFA studies of PTSD that used the DSM–5 (American Psychiatric Association, 2013) symptoms of PTSD and were published in English language peer-reviewed journals. Literature searches were conducted using PSYCinfo and MEDLINE databases using the keywords posttraumatic stress disorder, PTSD, factor analysis, and latent structure. We also looked for additional references in a recent review of the latent structure of PTSD (Armour, Műllerová, et al., 2016). This process resulted in us identifying 40 research articles published prior to our search date (August 23, 2017).
Inclusion criteria included using measures that included the 20 DSM–5 (American Psychiatric Association, 2013) symptoms of PTSD, applying CFA, and publication in peer-reviewed journals since 2013. Once these criteria were applied to the initial sample of 40, the sample was reduced to 23 publications. One article reported CFA results (Yang et al., 2017) that had been published in another as the first wave of a longitudinal invariance study (Wang et al., 2017); only the findings from the latter were used in the current review. One article reported results for two independent samples (Sachser et al., 2017) and another for four independent samples (Cyniak-Cieciura, Staniaszek, Popiel, Pragłowska, & Zawadzki, 2017). Our final sample was 23 publications, representing 27 independent samples. Sample sizes averaged 610.52 (minimum 134, maximum 4,624).
Results
Table 1 presents the models that were judged as best fitting in the articles we reviewed (Table S1 in the Supplemental Material available online presents the data for this review). Of the 27 CFAs included, the most complex models were judged to fit best in 22 (81%) studies (18/23 publications, 78%), 4 reported other preferred models, and 1 (Forbes et al., 2015) did not identify a preferred model. Of the 22 studies fitting the most complex models, 19 tested the 7F hybrid model, and 16 found it best fitting. One (Konecky, Meyer, Kimbrel, & Morissette, 2016) attempted to test the 7F hybrid model and reported a Heywood case, resulting in fitting the next most complex model, the 6F anhedonia model. The 6F anhedonia model was the most complex model tested in two CFAs and fit best in those two and in a third in which it was judged superior to the 7F hybrid model. Either the 7F hybrid or the 6F anhedonia model fit best in 19 of the 27 samples (70%). The 5F anhedonia model fit best in one sample, the 4F emotional numbing model fit best in five samples, and for one sample, the best-fitting model was a novel two-factor (2F) model that combined reexperiencing and avoidance in a proposed PTSD factor and NACM and hyperarousal in a nonspecific dysphoria factor (Hunt, Chesney, Jorgensen, Schumann, & deRoon-Cassini, 2017). One study tested 4F and 2F models but reported equivalent fit across tested models and did not state a preferred model (Forbes et al., 2015). The number of parameters reported for best-fitting models across studies and samples ranged from 41 to 61.
Models Tested in the PTSD Literature Since Publication of DSM–5
Note: PTSD = posttraumatic stress disorder; DSM–5 = fifth edition of the Diagnostic and Statistical Manual of Mental Disorders; Dys = dysphoria; R = reexperiencing; A = avoidance; NACM = negative alterations in cognition and mood; H = hyperarousal; DA = dysphoric arousal; AA = anxious arousal; AN = anhedonia; EB = externalizing behavior.
Of the four CFAs (15%) in which the most complex model was not judged as best fitting, two presented goodness-of-fit indices that were slightly better for the most complex model (the 7F hybrid) but chose other models on the basis of similarity of fit indices, parsimony and factor correlations less than .85 (Carragher et al., 2016), and similarity of fit indices and theoretical arguments (Hunt et al., 2017). High factor loadings were also a concern in the one study not explicitly stating which model fit best or was preferred by the authors (Forbes et al., 2015).
Of the best-fitting models, the 6F anhedonia model included two underidentified (i.e., two-item) factors. The 6F externalizing behavior and 7F hybrid models both included four underidentified factors. For six CFAs (23%), the 4F emotional numbing model with one underidentified factor (avoidance) fit best, and for one sample (and one study), the preferred model included no underidentified factors (Hunt et al., 2017).
Factor correlations for best-fitting models were reported in 11 studies (representing 11/27 CFAs because none of these studies reported CFAs for more than one sample). Correlations across all studies (n = 199) had a mean of .73 (SD = .12) and a median value of .73. The minimum value was .38 (anxious arousal with anhedonia in the 7F hybrid model; Mordeno, Nalipay, Sy, & Luzano, 2016) and the maximum .98 (negative affect with externalizing behavior in the 7F hybrid model; Zhou, Wu, & Zhen, 2017). Figure 1 presents the distribution of factor correlations. Notably, almost a third (n = 62, 31%) were r ≥ .80. Table 2 presents descriptive statistics for each factor (SDs were not included because of the small number of publications).

Correlations (n = 199) between factors in best fitting models for the fifth edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM–5; American Psychiatric Association, 2013) posttraumatic stress disorder (PTSD).
Descriptive Statistics for Factor Correlations Presented in the Literature Since Publication of DSM–5
Note: DSM–5 = fifth edition of the Diagnostic and Statistical Manual of Mental Disorders; NACM = negative alterations in cognition and mood.
Seventeen CFAs (63%; in 13 publications) examined associations with external variables in attempts to validate models. (Because of the volume of this information, these data are not included in Table S1; it is available on request from the first author.) Seven of these used only one external variable, four only two, three used four, one used five, one used seven, and one used nine. There was great variety in the nature of variables used for validation. Most common were measures of clinical diagnoses (four CFA reports reported association with anxiety, four with depression, four with borderline personality disorder, one with panic disorder), followed by more limited discrete psychological phenomena (three reported associations with anger, two with aggressive behavior or hostility, two with suicidal ideation, and one each for fear, impulsivity, alcohol use, sleep disturbance, guilt, and perceived injustice), broader constructs related to well-being (two CFAs reported associations with quality of life, two with interdependent and interdependent construals of self, and one each with internalizing, externalizing, somatic symptoms, mental functioning, physical functioning, and resilience), and exposure to trauma events (four CFAs). Most associations were measured using Pearson correlation coefficients, with linear regression coefficients reported in two studies, odds ratios in one, and structural equation model coefficients in one. Effect sizes were also quite heterogeneous, with magnitudes ranging from below .10 to above .90. Given the variety of constructs and types of associations, we felt that an inferential statistical summary would be misleading.
Discussion
Our analysis of the PTSD CFA literature published since the advent of DSM–5 (American Psychiatric Association, 2013) leaves us with the disappointing conclusion that support for increasing multidimensionality has only been possible because researchers have, for the most part, not followed basic premises concerning identifying factors and correlation between factors. It appears that our field has focused almost entirely on the magnitude of fit statistics while ignoring several preconditions for using CFA to examine construct validity. Our review of factor models shows that in the four years following publication of the DSM–5, the number of underidentified factors in commonly published PTSD models increased from one (avoidance) to two (in the 6F anhedonia model) and four (in the 6F externalizing behavior and 7F hybrid models). Moreover, many of the two-indicator factors seem to be composed of doublets, meaning they may be modeling similarity of language in the items rather than presenting evidence of latent factors. We doubt that such linguistic similarities are of significant clinical interest. Even if they were of significant clinical interest, establishing them as evidence of strong factors would involve more indicators. Increasing multidimensionality has built on trends begun prior to DSM–5: reexperiencing has remained as it ever was, avoidance is still modeled using the two symptoms that explicitly reference avoidance, and increasing multidimensionality has been drawn from the remaining set of symptoms that do not explicitly reference trauma (with the exception of trauma-related amnesia). With a fixed number of items and a push for more and smaller factors, the relative relevance of each factor to the PTSD construct likely decreases—no matter how coherent factors may be statistically. With 20 items and six or seven factors, it seems likely that at least one of these factors will be less relevant to clinical diagnosis. Furthermore, using these multiple factors in a diagnosis is problematic. Lachenbruch (1988) studied the effect of multiple-hurdles decision making on sensitivity and specificity on diagnosis. In particular, he noted that the effect was to suppress sensitivity and favor specificity. In the context of PTSD, the net effect would be to rule out cases that should not be ruled out. Furthermore, the reliability of small factors cannot be sufficiently high to justify using them for high-stakes decisions such as diagnoses.
In addition to modeling underidentified factors, correlations between factors in the recent PTSD CFA literature were high. Factor correlations in clinical psychology would be expected to be high because of the nature of psychological distress, but many studies reported correlations greater than .80. The implications of high correlations are that large parts of supposedly distinct factors are accounted for by other factors. In the current review, on average over half of the variance of one factor (as estimated by squaring the mean correlation) was accounted for by the variance of others, and for almost a third, this value was 64% or higher (i.e., r ≥ .80). These values suggest that the latent constructs they represent lack discriminant validity; at extreme values, they represent factor collapse. One need not be fluent in factor analysis concepts such as factor collapse and Heywood cases to understand that correlations between two constructs on the order of .8 and .9 suggest that the two constructs might not be meaningfully distinct from one another. It has been noted that the remaining gap between such values and 1.0 might well be accounted for by minor factors and other forms of small systematic variance (Podsakoff, MacKenzie, Lee, & Podsakoff, 2003).
It might be argued that small amounts of unique variance that display modest degrees of discriminant validity might be meaningful. However, our review found that this was rarely examined in tests of external validity. And even for cases in which external validity was examined, we maintain that high factor correlations remain problematic. If these factors were used as regressors in a structural regression model, they would represent substantial multicollinearity. Some basic calculations suggest that indicators of multicollinearity such as high variance inflation factors and a large condition number occur for factor correlations greater than .9, suggesting that this may be a useful line of demarcation, after which caution should be exercised. In the context of regression, highly associated factors would be likely to generate highly equivocal models that would often be subject to the usual ills of multicollinearity, such as suppression effects or unstable partial regression equations. It is difficult to say whether this would be ameliorated by using structural regression models (i.e., SEMs), but it seems unlikely.
We should be clear that the problems of underidentified factors and high correlations have not been entirely ignored in the recent PTSD literature. Of the studies we reviewed, 11 mentioned that using two-item factors was a limitation. Most went on to explain that two items may underrepresent a construct and left their concern there. Some contended that large numbers of participants somehow lessen the impact of this issue. However, as cited in Marsh et al. (1998), that two-item factors are particularly problematic for small n studies does not mean that they are acceptable for studies that include many participants. In any case, simply mentioning underidentification of factors in a limitations section should not be sufficient. No good clinician would base a clinical judgment on only two points of similar data, and clinical researchers should not either. Similarly, several studies noted high correlations between factors as a limitation, and some even argued for less complex models as a result (e.g., Carragher et al., 2016). Still, the majority of studies we reviewed did not mention the magnitude of factor correlations as a consideration in assessing construct validity.
It might be tempting to dismiss these critiques by arguing that best fit implies that a model is best. However, best fit is based on the mathematical fit of what is presented, which although necessary, is not necessarily sufficient to show construct validity. Fit statistics will indicate best fit for the model that best accounts for the most variance in residuals irrespective of the number of indicators per factor. Models that model doublets are expected to fit best because they account for relative differences in covariation better than models that model variation between more than two indicators. But two indicators do not allow us to conclude whether a supposed factor is evidence of anything more than just a high correlation. In other words, judging best models on statistical best fit alone will almost inevitably lead researchers to model more doublets.
Lest we be accused of throwing out the baby with the bath water, we wish to clarify that we do not think that all research on two-item factors is worthless or that all factor correlations reported in the literature suggest that PTSD is made up of only weak factors. Avoidance is usually measured with only two indicators, but we believe that there is a reason to suspect that this weakly measured factor may be a strong factor in disguise. Factor correlations with avoidance are notable for their relatively weaker magnitudes. For these reasons, we suspect that avoidance is a strong factor not measured well and represents its own clinical phenomenon. However, only after our field measures avoidance better—namely, includes more and more diverse indicators that increase its reliability—will this empirical question be answered. Until then, two-item measures of avoidance cannot be considered clinically reliable.
Our review did find a relative bright spot: Fully two thirds of the CFAs published since the DSM–5 (American Psychiatric Association, 2013) use external variables to validate factors. The variety represented among these reports was notable: from correlations with related diagnoses (e.g., depression; Liu, Wang, Cao, Qing, & Armour, 2016), larger clinical constructs (e.g., internalizing and externalizing disorders; Carragher et al., 2016), other discrete clinical phenomenon (e.g., anger and impulsivity; Armour, Contractor, Shea, Elhai, & Pietrzak, 2016), and exposure to trauma events (e.g., Wortmann et al., 2016). Although encouraging, we contend that this variety may evince a need to better distinguish between convergent validity and comorbidity. We encourage researchers to be more specific in their choices of external variables with which to validate the factors within their models and to justify them thoroughly (see Armour, Contractor, et al., 2016, as a good example). In addition, researchers should be explicit concerning the standards for effects sizes they use to support validation.
Another good practice documented in our review concerned samples sizes. With an increase in the number of factors, there is an accompanying increase in the number of parameters that must be estimated. Our review suggests that since the publication of DSM–5 (American Psychiatric Association, 2013), the number of parameters has increased by half, from 40 to 61. Proliferating factors increases the sample sizes needed to run CFA models. Our review found generally large sample sizes, suggesting that this may be less of a concern in the current literature than other issues. However, investigators would benefit from making use of these large sample sizes to perform cross-validation by examining models on random subsets of the data. This is a missed opportunity.
Implications for models of PTSD
CFAs of PTSD symptoms published since DSM–5 (American Psychiatric Association, 2013) have generally come to the same conclusion: More complex models are preferred. But all of these complex models model underidentified factors, propose doublets as separate indicators, and ignore very large correlations between factors—problematic signs that all is not well with the models despite what fit statistics might imply. Errors in designing and interpreting factor models of PTSD are problematic enough; however, given the volume of this research and its impact on clinical nosology, the field has done more than simply do bad statistics. Indeed, during our review of the literature, we were struck by an odd historical paradox: Prior to DSM–5, researchers advocated more complex models (i.e., the 4F models) than was the clinical state of the art; then, once a more complex model was adopted, they began showing how that model was not adequate and began advocating even more complex models. We do not mean to imply that there exists some factor proliferation conspiracy in our field, only that with added items, researchers recognized that they might test more complex theoretical models, and because our field generally ignores the methodological errors we have documented, they were able to support these models. Factor proliferation for PTSD is similar to the proliferation in polythetism brought about by the addition of symptoms to PTSD in DSM–5. Increasing the number of symptoms from 17 to 20 has increased the possibilities for ways that individuals can be diagnosed with PTSD to 636,120 (Galatzer-Levy & Bryant, 2013; Olbert, Gala, & Tupler, 2014). Adding more symptoms, and particularly adding more symptoms for factors that overlap with other clinical diagnoses, seems to have led to an increase in conceptual multidimensionality.
How can CFA modeling of PTSD be strengthened? We propose two solutions, one conservative, one more radical. The conservative solution is to bring PTSD CFA modeling into line with good latent variable modeling practices by improving our measures. To do so, we must unchain PTSD questionnaire development from a strict reading of the DSM. There is no inherent reason that PTSD questionnaires must have one item corresponding to one and only one symptom in the DSM. Indeed, empirically validated screening measures for many other psychological disorders are characterized by items that are not identical to their description in the DSM. Major depressive disorder (MDD) has 11 symptoms but is measured well by the 21 items of the Beck Depression Inventory–II (Osman et al., 1997; Ward, 2006), the 20 items of the Center for Epidemiological Studies-Depression Scale (Van Dam & Earleywine, 2011), and the 17 items of the Hamilton Depression Rating Scale (Williams, 2001). However, PTSD measurement has relied almost exclusively on a one-to-one symptom-to-item correspondence as a basis for developing survey instruments. Taking this critique seriously would suggest (a) leaving reexperiencing as is, (b) writing more avoidance items, and (c) perhaps reducing or otherwise limiting the number of indicators for other factors. There is one measure that already exists and meets these criteria. The Impact of Events Scale–Revised (IES-R; Weiss & Marmar, 1997) includes eight symptoms of intrusion, five symptoms of avoidance, three symptoms of numbing, and six of hyperarousal. Although not without its limitations—the primary being that it is based on DSM–IV (American Psychiatric Association, 1994) symptoms—the IES-R does serve as an example of what PTSD measurement can be. Another measure not strictly following DSM item structure is the 35-item Mississippi scale for combat-related PTSD (Keane, Caddell, & Taylor, 1988; and its 39-item civilian version; Norris & Perilla, 1996), which assesses more associated features of PTSD, similarly to how the BDI is broader than a strict DSM-correspondent measure of depression. We note, however, that the Mississippi scale is based on DSM–III (American Psychiatric Association, 1980) and caution that it only includes one item for avoidance.
The second, more radical solution is to address a conceptual flaw in the PTSD diagnosis itself: Much of PTSD is not unique to PTSD. This solution results from a reading of the PTSD CFA literature that interprets the march toward ever more complicated models based on two-indicator factors as indicating a major conceptual problem in the field and with the application of CFA more broadly. In other words, that ever more complicated models have been adopted to explain PTSD evinces more than simply bad statistics but rather is a sign that researchers are finding it necessary to build more and more complex models to explain a weak unitary phenomenon. Historians of science have noted that introducing more complicated models is a necessary precursor to shifts in conceptual paradigms. In the words of T. S. Kuhn (1962), “Through this proliferation of divergent articulations, (more and more frequently they will come to be described as ad hoc adjustments), the rules of normal science become increasingly blurred” (p. 83). Similar problems exist in other areas of psychology.
That much of PTSD is not unique to PTSD is not new; indeed, it has been noted since the inception of the PTSD diagnosis. It is evinced in high comorbidity with depression and anxiety disorders, something that is obvious from comparing symptoms between diagnoses (e.g., NACM shares six symptoms with MDD). That numbing and dysphoria symptoms of PTSD are very similar to depression symptoms has been noted several times (Gros, Simms, & Acierno, 2010; Simms et al., 2002), and factor-analytic studies that have looked at both PTSD and depression symptoms have found that numbing and dysphoria symptoms of PTSD load strongest with other depression symptoms rather than with PTSD symptoms (Elhai, Contractor, Palmieri, Forbes, & Richardson, 2011). This is not unique to PTSD: The Maslach Burnout Inventory (MBI; Maslach, Jackson, & Leiter, 1996) contains 22 items, of which 9 involve symptoms characteristic of depression, whereas the second edition of the Shirom-Melamed Burnout Measure contains 14 items, of which 11 involve symptoms characteristic of depression (Shirom & Melamed, 2006). In the PTSD literature, this overlap is among the symptoms that for 20 years now have been shifted around to form new factors. In other words, the PTSD CFA literature seems to suggest that PTSD is not one thing. A more careful and parsimonious interpretation of the PTSD CFA literature suggests that reexperiencing and avoidance are the only pieces of PTSD that meet standards of construct validity (although the latter is not measured well), and the rest is largely general negative affect of the sort typified by mood and anxiety disorders.
The more radical critique of the PTSD CFA literature holds that reexperiencing trauma events and going to great lengths to avoid reminders of them may be valid components of PTSD and that they are related to anxiety and depression. Good models of anxiety and depression and negative affect in general already exist, and if researchers want to model them, they should use those models rather than developing multiple models for constructs as if the constructs were unique to PTSD. By contrast, almost all the focus in modeling has been on carving up NACM and hyperarousal symptoms. Instead, we should focus our measurement energy on what makes PTSD PTSD. We found two examples of such research. Hunt et al. (2017) modeled a two-factor solution for DSM–5 (American Psychiatric Association, 2013) criteria in which the first factor comprised reexperiencing and avoidance and the second comprised the rest of the symptoms and found that the fit was statistically significant and not substantively inferior to the 7F hybrid model. Forbes et al. (2015) fit a similar two-factor solution and found that it fit well. Although the field need not limit itself to two-factor models, these examples do provide what we would consider more conceptually coherent models of PTSD.
Limitations
We do not claim to propose solutions for all of what ails PTSD research. There are several areas in which the PTSD literature could use improvement, and we do not pretend to have covered all problems. Nor do we wish to portray the PTSD CFA literature as worthless. As reported, two thirds of the articles we reviewed examined concurrent validity using variables conceptually related to PTSD. In addition, in the course of our review, we found several examples of innovative methodological approaches. C. M. Chen, Yoon, Harford, and Grant (2017) modeled a bifactor model using exploratory structural equation modeling in which a general distress factor accounted for most of the variance and orthogonal factors were distinguished on top of that. Several studies examined changes in prevalence rates based on different models of PTSD (e.g., Cyniak-Cieciura et al., 2017). All of this is in line with good external validation techniques (Elhai & Palmieri, 2011), and we encourage their continued application in the literature. We limited our review to DSM–5 (American Psychiatric Association, 2013) models using 20 symptoms and focused on statistical issues related to CFA not because other areas are not important or notable but because we feel that the statistical limitations in the CFA literature are currently the most problematic for PTSD’s construct validity.
We also wish to make clear that we come to these critiques as researchers who have engaged in the very practices we decry. We have published using two-item factors, moved items from numbing and hyperarousal to support a novel factor, and published models with high factor correlations (Rasmussen, Smith, & Keller, 2007; Rasmussen, Verkuilen, Ho, & Fan, 2015). Even when we have noted the limitations of these practices (Rasmussen et al., 2015), we have still held our noses and followed the conventions of our field. We write this review understanding full well the pressures to publish such work and the dilemmas posed for psychologists by the strict reading of the PTSD diagnosis on the one hand and the multiple forms of negative affect on the other. Rather than a limitation per se, we hope that this history of engagement in the field lends our arguments some measure of credibility.
Recommendations
In closing we offer the following suggested guidelines for the PTSD CFA literature:
To be identified, factors must have three indicators, and ideally they should have more than three. Researchers should use at least three items to identify factors in models that they wish to test. Underidentified factors are not sufficient evidence of valid clinical phenomena.
Items should be written to avoid conceptual duplication and thus avoid doublets. If doublets occur, researchers should provide careful justification for using them.
Statistical fit should not be the only measure of models’ acceptability. Although some fit statistics penalize for number of parameters, none include standards for factor identification or factor correlations.
Factor correlations should be included in criteria for judging what is an acceptable model. We contend that a phenomenon that accounts for a large amount of the variance of another lacks sufficient construct validity without considerable external justification. While evidence suggesting a meaningful cutoff value that would invalidate a factor model does not exist, the construct validity of factor models with high factor correlations should be addressed by examining the highly correlated factors’ discriminant validity vis-à-vis clinically meaningful external variables.
PTSD researchers should develop measures that are not limited by the number and letter of DSM–5 (American Psychiatric Association, 2013) symptoms. Particular attention should be paid to writing items for avoidance, which up to now has been measured by only two indicators.
PTSD researchers should consider modeling NACM and hyperarousal using well-researched multidimensional models of depression and anxiety rather than treating them as areas of measurement unique to PTSD.
Supplemental Material
Rasmussen_Supplemental_Table – Supplemental material for When Did Posttraumatic Stress Disorder Get So Many Factors? Confirmatory Factor Models Since DSM–5
Supplemental material, Rasmussen_Supplemental_Table for When Did Posttraumatic Stress Disorder Get So Many Factors? Confirmatory Factor Models Since DSM–5 by Andrew Rasmussen, Jay Verkuilen, Nuwan Jayawickreme, Zebing Wu and Sydne T. McCluskey in Clinical Psychological Science
Footnotes
Action Editor
Scott O. Lilienfeld served as action editor for this article.
Author Contributions
A. Rasmussen, J. Verkuilen, and N. Jayawickreme developed the study concept and contributed to the study design. A. Rasmussen, N. Jayawickreme, and Z. Wu contributed to data collection and performed the data analysis. A. Rasmussen, J. Verkuilen, N. Jayawickreme, and S. T. McCluksey were responsible for interpretation and drafted the manuscript. Z. Wu was responsible for edits and formatting references. All the authors approved the final manuscript for submission.
Declaration of Conflicting Interests
The author(s) declared that there were no conflicts of interest with respect to the authorship or the publication of this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
