Abstract
In psychiatry, severity of mental health conditions and their change over time are usually measured via sum scores of items on psychometric scales. However, inferences from such scores can be biased if psychometric properties such as unidimensionality and temporal measurement invariance for instruments are not met. Here, we aimed to evaluate these properties for common measures of depression (Patient Health Questionnaire–9) and anxiety (Generalized Anxiety Disorder Assessment–7) in a large clinical sample (N = 22,362) undergoing psychotherapy. In addition, we tested consistency in dimensionality results across different methods (parallel analysis, factor analysis, explained common variance, the partial credit model, and the Mokken model). Results showed that while both Patient Health Questionnaire–9 and Generalized Anxiety Disorder Assessment–7 are multidimensional instruments with highly correlated factors, there is justification for sum scores as measures of severity. Temporal measurement invariance across 10 therapy sessions was evaluated. Strict temporal measurement invariance was established in both scales, allowing researchers to compare sum scores as severity measures across time.
The assessment of mental health conditions—whether for the purpose of research, screening, diagnostics, or outcome evaluation in therapy—plays a crucial role in psychological and psychiatric research, as well as in clinical practice. Despite progress in recent years, mental health research still lacks biological markers (Prata et al., 2014; Venkatasubramanian & Keshavan, 2016), and relies largely on questionnaires and scales assessing subjectively rated somatic and psychological symptoms which are hypothesized to be related to candidate diagnostic syndromes (Kapur et al., 2012). Therefore, it is of utmost importance that the measurement indicators that are used by clinicians to determine whether someone needs help, benefits from therapy, or progressed to recovery, need to be psychometrically valid and reliable. If not, measurement indicators do not reflect the measured construct and the true progress of the patient. This may lead to patients staying in therapy for an unnecessarily long time, incurring extra cost or being discharged from clinical services before true recovery is reached. This calls for careful assessment of the psychometric properties of popular scales.
Both unidimensionality and temporal measurement invariance (hereafter TMI) are critical psychometric properties for scales which are used for assessment of mental health in epidemiological and clinical research as well as in therapeutic practice. Particularly in clinical settings, measurement tools for mental health conditions are often used over time to monitor individual improvement and recovery. Simple sum scores (whether for the total scale or for subscales) are utilized for simplicity and convenience. Unidimensionality is a necessary (yet not sufficient) condition for the meaningful interpretation of sum scores (Heene et al., 2016) and TMI is an additional condition for the meaningful interpretation of sum score changes over time.
Fried et al. (2016) investigated unidimensionality and TMI in four common scales for depression (Hamilton Rating Scale for Depression, Quick Inventory of Depressive Symptoms, and two versions of Inventory of Depressive Symptoms [clinical and self-rated]) which routinely use sum scores as a summary statistic in research and clinical practice. They found that both properties did not hold in any of the scales, which challenges “the interpretation of sum scores and their changes as reflecting one underlying construct” (p. 2). Here, our primary aim is to replicate and extend this work by investigating dimensionality and TMI (a) for different measurement instruments, (b) for depression as well as for anxiety, (c) in a larger sample, (d) using 10 (rather than 2 time points as in Fried et al. [2016]), and (e) using a more extensive set of methods to explore the issue of unidimensionality versus multidimensionality of the scales from different perspectives.
In this study, we analyzed two patient reported outcome measures routinely used to monitor depression and anxiety therapy outcomes in a major U.K. primary mental health service: the Patient Health Questionnaire–9 (PHQ-9; Kroenke et al., 2001) and the Generalized Anxiety Disorder Assessment–7 (GAD-7; Spitzer et al., 2006). We focused on the following goals:
First, we tested the dimensionality of the item sets comprising the PHQ-9 and GAD-7, treating all responses as ordered categories (ordinal data). In addition, we evaluated whether different psychometric techniques, when applied to the same data set, provided consistent answers. Dimensionality refers to the number of latent variables that can be estimated from the data and is thus closely related to the scoring of the questionnaire. Indeed, unidimensionality of the instrument (i.e., a single latent variable) is one of the requirements for the justification of using sum scores (the total of the item scores) as summary statistics. This is because, simply put, unidimensionality assures that a single score is a defensible way of scoring each individual (Zwitser & Maris, 2016). It is, however, not a sufficient condition as it does not say what mathematical form such score should take, that is, how such a score should be generated. More stringent psychometric requirements may apply to justify using sum scores and they depend on the psychometric model. We discuss sufficient conditions and their evaluation within factor analytic and item response theory (IRT) frameworks in the appendix. When an instrument measures multiple constructs, scoring each construct separately (i.e., making sum scores for subscales) may provide more useful and psychometrically sound statistics (Smith et al., 2009). However, in both research and clinical practice, sum scores are frequently used without strong empirical evidence for the unidimensionality of the instrument. For example, the Hamilton Rating Scale for Depression (Hamilton, 1960), one of the most commonly used depression measures in clinical practice, is often scored using a sum score of 17 (out of 21) items despite considerable evidence indicating its multidimensionality (Bagby et al., 2004; Hamilton, 1967; Shafer, 2006). Hamilton himself recommended scoring dimensions separately instead of using a “total crude score” yet these recommendations are regularly ignored. This might also be the case for other questionnaires with a potentially multidimensional structure where the existence of separate constructs are ignored, and unidimensionality is effectively “assumed.” In addition, there is sometimes considerable heterogeneity between studies evaluating dimensionality for the same instrument. For example, PHQ-9 and GAD-7 have been investigated by different authors and found to be unidimensional by some (e.g., Gonzalez-Blanch et al., 2018; Lowe et al., 2008) but multidimensional by others (e.g., Beard & Bjorgvinsson, 2014; Elhai et al., 2012).
Second, we tested TMI in the PHQ-9 and GAD-7. TMI refers to the degree to which construct validity of the instrument stays stable over time and is thus closely related to the fairness of temporal comparisons of scores. If TMI holds, changes in the sum score of a given sample represent actual differences in the construct measured through the rating scale (Fried et al., 2016). If TMI does not hold, observed differences in sum scores over time do not necessarily reflect (and cannot be fully attributed to) temporal changes of the latent variable. We provide a TMI investigation, comparing PHQ-9 and GAD-7 across 10 time points.
Apart from extending the work of Fried et al. (2016), this study has three additional aims. The first one is to investigate whether various methods for dimensionality assessment provide consistent outcomes when the results of their analyses are compared. The second one is to argue and showcase that multidimensional scales may still be usefully summarized using a sumscore. The third one is to illustrate a number of different psychometric techniques that can be used for the assessment of dimensionality. We provide statistical code to implement each method and synthetic data. We hope this will enable readers to adopt our examples, explore these methods, and conduct sets of evaluations on their own data.
Method
Setting
The Improving Access to Psychological Therapies (IAPT) program in England began in 2008 with a direct objective to improve access to evidence-based psychological treatment for common mental disorders such as anxiety and depression. The program has continued to expand over time and currently assesses over 1.6 million people with common mental disorders annually, delivering therapy to approximately 1.06 million people. It aims to increase public access to psychological therapies approved by the National Institute for Health and Care Excellence through offering flexible referral routes (including self-referral and stepped care pathways). Accordingly, the IAPT program provides low- (Step 2) or high-intensity (Step 3) treatment to people aged 16+ years. Low-intensity IAPT approaches include guided self-help, psychoeducation, computerized cognitive–behavioral therapy, behavioral activation, and structured group physical activity programs.(Clark, 2018) In high-intensity IAPT services, face-to-face cognitive behavioral therapy is the predominant approach, although there is a wider range of recommended treatments (e.g., eye movement desensitization and reprocessing, interpersonal psychotherapy, counselling for depression, compassion-focused therapy, and integrative counselling). In high-intensity IAPT, patients receive seven sessions on average over a period of 3 to 4 months. Nationally, recovery rates exceed 52%, about quarter of patients (25.7%) do not improve, and small percentage (5.8%) deteriorate. Dropout rates are relatively high (approximately 46%).
Primary Measures: PHQ-9 and GAD-7
At each therapy session, IAPT therapists routinely assess depression and anxiety symptomatology using the nine-item PHQ-9 (Kroenke et al., 2001) and the seven-item GAD-7 Questionnaire (Spitzer et al., 2006). Both scales were adopted by the IAPT program nationally because of their sound validity (Cameron et al., 2008; Maroufizadeh et al., 2019; Spitzer et al., 2006; Titov et al., 2011), reliability (Johnson et al., 2019; Maroufizadeh et al., 2019), sensitivity and specificity (Levis et al., 2019; Spitzer et al., 2006) and brevity. They are officially used to monitor recovery rates across all IAPT services. Total scores on both instruments are computed as a sum score of items (response categories are identical for both instruments: 0 = not at all; 1 = several days, 2 = more than half the days; 3 = nearly every day). Thus, PHQ-9 scores can range from 0 to 27, where cores of 5, 10, 15, and 20 represent cutpoints for mild, moderate, moderately severe and severe depression, respectively. GAD-7 scores can range between 0 and 21. Scores of 5, 10, and 15 represent cutoff points for mild, moderate, and severe anxiety, respectively. In IAPT, individuals are described as at “caseness,” if they score above the clinical cutoff for depression (PHQ-9 ≥ 10; Manea et al., 2012) or anxiety (GAD-7 ≥ 8) and are in recovery if they score below these thresholds for both measures. Here, we have analyzed the PHQ-9 and GAD-7 data from the first 10 therapy appointments.
Participants
We included all IAPT patients from two trusts (Cambridge and Peterborough Foundation Trust and Sussex Partnership NHS Foundation Trust) who received services between February and December 2018. Data from 22,362 individuals was available for the first therapy session of which 66.4% were women, 33.3% were men, and 0.3% had missing data on gender. Mean age of the sample was 40.1 years (SD = 15.4 years). Most individuals in the sample were White (88.2%) and the remainder was divided into four ethnicity categories (1.1% were Indian, 0.8% Asian, 0.7% Black, and 2.4% stated mixed or other ethnicity background). Information on ethnicity for 6.8% of patients was missing. An average patient severity at the start of the therapy was moderate, with sum score mean of 13.6 for PHQ-9 (SD = 6.28) and 12.6 (SD = 5.3) for GAD-7. Histograms of sum scores for both measures are provided in the online Supplementary Figures S1 and S2.
The sample size decreased considerably as available therapy session data increased, due to both dropout and discharge of patients. Yet a subsample of 6,554 individuals had PHQ-9 and GAD-7 scores for 10 therapy sessions. Sample sizes, means, and standard deviations for PHQ-9 and GAD-7 total scores for each therapy appointment are available in Figure 1.

Means and standard deviations for PHQ-9 and GAD-7 sum scores across 10 therapy appointments.
Statistical Analysis
For the assessment of dimensionality, we examined the number of factors needed to describe each questionnaire at each therapy session. A large number of psychometric approaches were used to test dimensionality including (a) parallel analysis (PA; Horn, 1965), (b) exploratory factor analysis (EFA), (c) confirmatory factor analysis (CFA), (d) parametric (Rasch) IRT model, (e) nonparametric IRT (Mokken) model, and (f) explained common variance (ECV). It is important to note that for the sake of brevity and clarity, we only report outcomes of analyses relevant for dimensionality assessment. Thus, some typical or recommended outcomes of these psychometric techniques are missing. This note is specifically relevant for partial credit model (PCM) and the Mokken model.
Confirmatory Factor Analysis
We first assessed the fit of a one-factor model at each measurement point (therapy appointment) to evaluate whether unidimensionality can be justified using a confirmatory approach. The CFA model fit was considered good if the root mean square error of approximation (RMSEA) was 0.06 or lower, standardized root mean squared residual was 0.08 or lower, and the comparative fit index (CFI) was 0.95 or higher (Hu & Bentler, 1999). We have considered that items are ordinal and used mean and variance adjusted weighted least squares (WLSMV) as the estimator. We used Mplus software (L. K. Muthén & Muthén, 1998-2019) to estimate CFA models.
Parallel Analysis and Exploratory Factor Analysis
In the case that unidimensional models using CFA did not fit the data, we used PA to determine the number of factors. To compare results with (Fried et al., 2016), we mimicked their setting for PA. To this end, we compared the observed eigenvalues with eigenvalues of randomly drawn data, and we extracted factors for which the eigenvalues exceeded the randomly generated eigenvalues (50 parallel data sets for each analysis and used 95% eigenvalue percentiles). We used the function fa.parallel from the R-package psych (Revelle, 2018). Using EFA (in Mplus) with a WLSMV estimator, we have assessed the fit of models with two to five factors (note that a one-factor model was tested using CFA) with oblimin factor rotations. The most parsimonious model which met the same fit criteria as described above for the CFA model was then selected.
Partial Credit Model
A PCM (Masters, 1982) is a model for polytomous item responses from a family of Rasch models and therefore shares the distinguishing characteristics of that family: separable person and item parameters, raw scores as sufficient statistics (i.e., the sum score carries all the information about the measured attribute of the respondent), and, hence, conjoint item score additivity (Masters & Wright, 1997). A good fit of data with Rasch model provides stringent support for the existence of a single, quantitative, and unidimensional psychological variable underlying the scale items (Glas & Verhelst, 1995; Heene et al., 2016). We therefore conclude unidimensionality when all items fit the PCM. Fit is evaluated using indices such as outfit and infit. These statistics are based on standardized residuals, which are the standardized differences between the observations and their expected values according to the Rasch model. Their sum of squares approximates a χ2 distribution and the outfit is simply the ratio of the χ2 and its degrees of freedom (Wright & Masters, 1990). Infit is an information-weighted form of outfit. The weighting reduces the influence of less informative, low variance, off-target responses. The expected value of outfit and infit is 1.0 and ranges from 0 to infinity. Values larger than 1.0 indicate unmodeled noise on a ratio scale (e.g., 1.1 indicates 10% excess noise). Values less than 1.0 indicate overfit of the data to the model, that is, too predictable observations (Linacre, 2002). Rating scales items (such as those of the PHQ-9 and GAD-7 have an acceptable fit when these indices range between 0.6 and 1.4 (Wright & Linacre, 1994). For this analysis we used R package eRm (Mair & Hatzinger, 2007).
Mokken Model
We also empirically assessed the questionnaire dimensionality within the framework of Mokken models (Mokken, 1971) using R package mokken (van der Ark, 2012). Mokken models are often seen as a nonparametric version of Rasch models (Stochl et al., 2012). For this, we used Loevingers’ (1947) item scalability coefficients cutoffs, which were according to recommendations increased from 0.3 up to 0.45 (in 0.05 increments; Stochl et al., 2012). Note that we did not aim to evaluate other constituting properties of the Mokken models (monotonicity and nonintersection of item response functions, local independence) but we simply used this approach as an automated engine to explore how it would build unidimensional (sub)scales of the instrument (Gillespie et al., 1987; van der Ark, 2012). Unidimensionality was concluded if the engine extracted only a single Mokken scale and, at the same time, all items from the corresponding instruments were included in this scale.
Omega Hierarchical (ωH) and Explained Common Variance
Hierarchical omega (ωH) is the coefficient proposed by McDonald (1999) which estimates the proportion of variance in total scores that can be attributed to a single general factor. Hierarchical omega can also be interpreted as the reliability coefficient (the larger the coefficient, the more accurately one can predict an individual’s relative standing on the latent variable common to all the scale’s indicators based on their observed scale score) and as the generalizability coefficient (square of the correlation between the scale score and the latent variable common to all the indicators; Revelle, 2018). To calculate ωH we used a function in the R-package psych (Revelle, 2018) which estimates a factor model with oblique factor rotation and performs the Schmid Leiman transformation to find general factor loadings and then calculates the index itself. The ECV is an index similar to ωH in terms of interpretation, but superior to ωH as an index of unidimensionality as it utilizes only the reliable variance of the sum scores (P. M. Bentler, 2009; Reise et al., 2010; Ten Berge & Sočan, 2004). ECV was computed based on formula provided by Reise et al. (2010). Both ωH and ECV were used to evaluate the extent to which scores reflect a single latent variable even when the data are multidimensional, that is, in the presence of more than one highly related subdimensions. Hence, even if the questionnaires are multidimensional, sum scores may be justified, if the percentage of ECV is high.
Temporal Measurement Invariance
The assessment of TMI was conducted as an iterative process during which we increased equality constraints on the most parsimonious well-fitting factor structure for both instruments obtained from EFA, correspondingly testing configural (Model 1 [M1]), weak (Model 2 [M2]), strong (Model 3 [M3]), or strict (Model 4 [M4]) invariance. As a first step, a configural invariance model M1 was fit to the data of all measurement points per instrument; the model imposes no equality constraints on the parameters, and only restricts the number of factors to be equal across time. In the next step, the weak factorial invariance M2 was estimated; M2 constrains item loadings to be equal across time. The strong factorial invariance model M3 additionally constrains thresholds to be equal across time, and the strict invariance model M4 forces all residual invariances to be equal on top of all previous constraints. Once estimated, each model is compared with the previous one with respect to the fit to the data. If introducing equality constraints decreases the fit significantly, measurement invariance is rejected. TMI can be established only if M4 is not rejected (Meredith, 1993). We refer the reader to B. Muthén and Asparouhov (2013) for a thorough descriptions of these constraints within Mplus, and Millsap (2011) for a general overview and interpretation of TMI models.
Code Availability
To help the reader conduct our analyses on their own data, we provide the analysis code at https://osf.io/r2e63/.
Data Availability
Data were made available for analysis as part of an exploratory evaluation project (forming part of an National Institute for Health Research (NIHR) program grant for applied research number PG-0616-20003); due to the confidentiality and protection of the original dataset we were not allowed to provide the data. However, we created synthetic data with almost identical descriptive statistics, distributional properties, and covariances/correlations using R package synthpop (Nowok et al., 2016). The synthetic data can be used to mimic the analyses carried out in this article and is available online at https://osf.io/r2e63/.
Results
Description of PHQ-9 and GAD-7 Sum Scores by Cumulative Appointments
We show means and standard deviations for the PHQ-9 and the GAD-7 sum scores in Figure 1. Those scores suggest that patients improve over time in both depression and anxiety, and the heterogeneity of the sum scores is similar across appointments (the variances appear not to vary). Distribution of sumscores is depicted in the online Supplementary Figures S1 and S2.
Assessment of Dimensionality
Confirmatory Factor Analysis
Fit indices for unidimensional models for PHQ-9 and GAD-7 across therapy sessions are reported in Table 1. For both instruments, goodness-of-fit of the one-factor model varied per fit index and provided a somewhat conflicting message. The CFI index which compares the one-factor model with estimated factor loadings and factor variance constrained to 1 to the null model (i.e., the model where all factor loadings equal 1 and variance of the factor is set to 0) showed an acceptable fit regardless of the time point. Similarly, the standardized root mean squared residual, the fit index evaluating the size of residual correlations, showed good fit across time points. On the other hand, RMSEA values showed a consistently poor fit for both the PHQ-9 and the GAD-7. There is no clear explanation of inconsistency between RMSEA and other indices as it may stem from the nonlinear interplay between fit of the baseline model and degrees of freedom of the model (Lai & Green, 2016).
Fit Indices of CFA.
Note. CFA = confirmatory factor analysis; PHQ-9 = Patient Health Questionnaire–9; GAD-7 = Generalized Anxiety Disorder assessment–7; CFI = comparative fit index; RMSEA = root mean square error of approximation; CI = confidence interval; SRMR = standardized root mean squared residual.
p < .001
Parallel Analysis and Exploratory Factor Analysis
PA suggested that both instruments have a multidimensional structure, although one dominant factor emerged for both instruments at all time points. For the PHQ-9, four factors were extracted with exception of the ninth appointment for which three factors described the data best. For the GAD-7, two factors were extracted for 8 out of 10 time points and three factors were identified at Appointment 1 and 7.
The EFA analyses showed consistent results across time. The minimal number of factors to achieve good fit (i.e., model having CFI over 0.95 and, at the same time, RMSEA below 0.06) was 3 for the PHQ-9 and 2 for GAD-7. The factorial structure (outlined in a note under the Table 2) was stable across time for both instruments. These findings are presented in Table 2.
Fit Indices of the EFA Models Which Satisfy Close Fit (CFI > 0.95 and RMSEA < 0.06) Across 10 Therapy Appointments.
Note. EFA = exploratory factor analysis; CFI = comparative fit index; RMSEA = root mean square error of approximation; PHQ-9 = Patient Health Questionnaire–9; GAD-7 = Generalized Anxiety Disorder assessment-7; PA = parallel analysis.
Factor 1: “Interest,” “Hopeless,” “Feeling Bad,” “Hurt”; Factor 2: “Asleep,” “Tired,” “Appetite”; Factor 3: “Concentrate,” “Moving.” M (SD) factor correlations: Factors 1 and 2 = 0.799 (0.034); Factors 1 and 3 = 0.819 (0.029); Factors 2 and 3 = 0.778 (0.045). bFactor 1: “Nervous,” “Cannot Control Worry,” “Worry Too Much,” “Afraid”; Factor 2: “Trouble Relax,” “Restless,” “Annoyed.” M (SD) factor correlation = 0.731 (0.027).
Please see the online Supplementary Table S1 for full item wording.
Partial Credit Model
The item fit for the PCM is presented in Table 3. Both infit and outfit were in the range for an acceptable item fit across all time points (0.6-1.4). This indicates that all items fit the PCM, which supports a unidimensional factorial structure for both scales.
Item Fit Indices of the PCM Models Across 10 Therapy Appointments.
Note. PCM = partial credit model; PHQ-9 = Patient Health Questionnaire–9; GAD-7 = Generalized Anxiety Disorder assessment–7.
Mokken Model
Table 4 shows abridged results of fitting a Mokken model across therapy appointments. For both instruments, a single Mokken scale was extracted based on recommended Loevingers’ item scalability coefficient (Hi) threshold of 0.3 (Loevinger, 1947; Mokken, 1971). No items were excluded. We gradually increased the cutoff in line with recommendations up to 0.45 (Stochl et al., 2012), but the results did not change. This provides empirical justification for the unidimensionality of both instruments. In addition, the scalability coefficient H (a measure of strength of the extracted unidimensional scale) was over 0.5 (with exception of session 1 for PHQ-9 where H = 0.482) which is indicative of “strong homogeneity/unidimensionality” of the extracted scale (Sijtsma & Molenaar, 2002; Stochl et al., 2012).
Results of Mokken Automatic Item Selection Procedure Across 10 Therapy Appointments.
Note. There was always single Mokken scale found and no items were excluded. PHQ-9 = Patient Health Questionnaire–9; GAD-7 = Generalized Anxiety Disorder assessment–7; H = scale scalability coefficient; SE = standard error; Hi = item scalability coefficient.
Hierarchical Omega and Explained Common Variance
Based on the hierarchical omega and ECV values in Figure 2, we can conclude that across appointments, 79% to 86% of the variance (73% to 80% of reliable variance) of the sum score of PHQ-9 and 76% to 85% of the variance (74% to 84% of reliable variance) of the sum score of GAD-7 is attributable to variance on the corresponding general factor. Interpretations of ωH allow for two additional conclusions: (a) reliability of both instruments is satisfactory and (b) correlation between sum score and the corresponding general latent variable lies between 0.89 and 0.93 for PHQ-9 and between 0.87 and 0.92 for GAD-7 (computed as square roots of the ωH).

Omega hierarchical and estimated common variance across 10 therapy appointments.
Assessment of Temporal Measurement Invariance
Fit indices of models with constraints specific to each level of TMI are presented in Table 5. Note that TMI constraints are imposed on the most parsimonious well-fitting factor structure derived from the EFA models (three factor for PHQ-9 and two factor for GAD-7). Results are similar for both instruments. Chi-square values suggest significant difference across TMI models, but this finding is expected in large samples regardless of true model differences. All other fit indices suggest negligible differences in fit between configural, weak, strong, and strict invariance models. The fact that the strict invariance models do not fit worse compared with corresponding configural models supports the notion that TMI holds for both the PHQ-9 and the GAD-7. Interestingly, RMSEA and CFI show marginal superiority for more constraint models.
Temporal Measurement Invariance Across 10 Therapy Appointments.
Note. PHQ-9 = Patient Health Questionnaire–9; GAD-7 = Generalized Anxiety Disorder assessment-7; CFI = comparative fit index; TLI = Tucker–Lewis index; RMSEA = root mean square error of approximation; SRMR = standardized root mean squared residual.
Discussion
Recently, the concern has been raised that measurement of depression over time is problematic due to violations of psychometric properties that permit usage of sum scores as suitable summary statistics (Fried, 2017; Fried et al., 2016; Shafer, 2006). Such concern is particularly relevant to mental health research as well as clinical practice in which sum scores are often used to monitor change of both depression and anxiety over time. This study aimed to investigate the dimensionality and TMI for two widely used depression and anxiety scales routinely used to monitor therapy outcomes in primary mental health services in the United Kingdom.
Dimensionality
Three of the five applied approaches (PA, EFA, and CFA) suggested a multidimensional structure of both scales. Parametric (PCM) and nonparametric (Mokken model) IRT approaches, however, supported a unidimensional structure. These results do not need to be seen as conflicting. In our interpretation of the models, there is evidence for multidimensionality in both scales, but these dimensions are highly correlated. The ECV and hierarchical omega coefficients, which were derived from a bifactor model framework, suggested that the structure of both scales is dominated by a strong general factor capturing around 80% of the variance of all items. Therefore, we argue that the main finding supports the use of sum scores as a suitable summary statistic for both the PHQ-9 and the GAD-7.
In the literature, factor structures reported for these instruments are inconsistent. For PHQ-9, previous studies reported unidimensional (Gonzalez-Blanch et al., 2018; Keum et al., 2018) as well as two-dimensional structures (Chilcot et al., 2013; Elhai et al., 2012; Guo et al., 2017; Krause et al., 2010; Richardson & Richards, 2008), consisting of somatic and affective factors. Reported GAD-7 structures include unidimensional (Lowe et al., 2008; Sousa et al., 2015), modified unidimensional (Bartolo et al., 2017; Johnson et al., 2019; Lee & Kim, 2019), or two factors (Beard & Bjorgvinsson, 2014; Kertz et al., 2013). We believe that this inconsistency may stem from the methodological plurality in the literature, where different methods support different conclusions—very similar to our own investigation.
Temporal Measurement Invariance
We compared the fit of increasingly constrained models to evaluate TMI of PHQ-9 and GAD-7. For the configural model, we used the most parsimonious, well-fitting factor structure derived from the EFA models (three factors for PHQ-9 and two factors for GAD-7). The fit was similar across configural, weak, strong, and strict invariance models. Chi-square differences between models were found, but ignored due to our extremely large sample size, in which case the use of this statistic is not recommended (P. M. Bentler, 1990). CFI, Tucker–Lewis index, and RMSEA indices even showed a slightly superior fit for more constrained models. These results suggest that measurement invariance holds and provide empirical justification for the comparability of scores across time (Cheung & Rensvold, 2002).
Previously, TMI was supported for a two-dimensional PHQ-9 solution (Elhai et al., 2012; Guo et al., 2017). The results of previous studies, where PHQ-9 was considered as unidimensional measure, are both positive (Gonzalez-Blanch et al., 2018) and negative (Downey et al., 2016) with regard to TMI. Studies for GAD-7 are scarce but homogeneous in support of TMI (Mewton et al., 2014; Naragon-Gainey et al., 2014).
Strengths and Limitations
This study benefits from a primary care sample that is not only large but is also fairly representative of the clinical population seeking psychological therapies (Knight et al., 2020). However, the average number of therapy sessions was eight in our sample, compared to seven in the general IAPT sample. This may indicate that our sample is a little less treatment responsive than the “general” IAPT population.
Our study has several limitations. First, the sample has a notable attrition due to dropout from therapy or discharge of individuals when they reach recovery; only about 30% of the original sample seen at baseline had 10 or more appointments. This is expected because the average number of appointments in IAPT is seven (NHS Digital, 2020). In our sample, 55.5% completed scheduled treatment, 22.11% of cases dropped out before their treatment was finished (the end of care reason was unknown for 15.3% of cases and the remaining cases were discontinued for various reasons, e.g., discharge to secondary care). Arguably, the subsample of individuals with a large number of appointments is structurally different from the original sample as it consists of individuals who need/require more treatment. As such, we do not necessarily see such structural differences as a limitation. For example, the fact that the dimensionality and factorial structure are the same across appointments (and thus potentially across structurally/qualitatively different subsamples) may indicate measurement invariance across classes of individuals who respond differentially to IAPT therapy. We suggest that conjectures regarding subgroup invariance should further be evaluated in future studies.
Second, a potential constraint may be that the patients were allocated to therapies of different intensity: less severe cases are allocated to low intensity therapy (46.6% of our sample) and more complex/severe cases (53.4%) into high-intensity therapy; this was not taken into account in our analyses. Therefore, we cannot be sure that we would have revealed unidimensional and TMI had we tested the models separately per treatment arm. On the other hand, this can also be seen as an advantage because it suggests that the unidimensionality and the TMI of PHQ-9 and the GAD-7 hold up in a natural setting providing various different treatment interventions together into a single sample.
Third, we did not evaluate the meaningfulness of sum scores nor the validity of the studied scales from a content validity perspective. Indeed, the item coverage of the PHQ-9 and the GAD-7 may not be ideal. Thus, although the measures seem to be fairly unidimensional and invariant, they may not evaluate the disorders in their full breadth. However, this limitation is not specific to the measures scrutinized here, and applies across mental health measures (Fried, 2017).
Fourth, as indicated above, temporal invariance does not imply subgroup measurement invariance, which we decided to not investigate in this study. In other words, even if PHQ-9 and GAD-7 scores may adequately reflect within-individual changes of the disorder, such scores may not provide fair comparison across subgroups such as gender or ethnicity.
Finally, a technical limitation is that Mplus does not provide robust maximum likelihood estimation to estimate likelihood-based fit indices such as the Akaike information criterion and Bayesian information criterion which would provide a more straightforward comparison of TMI models.
Conclusion
Our results show that both PHQ-9 and GAD-7 can be considered as multidimensional measures but with a strong corresponding general factor, which explains around 80% of the variance of unweighted sum score of items. Hence, we propose that using sum scores for either scale is acceptable. In addition, TMI appears to hold for both scales. This supports the conjecture that meaningful comparisons of sum scores of the PHQ-9 and the GAD-7 over time are justified, which is crucial for longitudinal research as well as for monitoring outcomes in clinical practice.
Supplemental Material
sj-pdf-1-asm-10.1177_1073191120976863 – Supplemental material for On Dimensionality, Measurement Invariance, and Suitability of Sum Scores for the PHQ-9 and the GAD-7
Supplemental material, sj-pdf-1-asm-10.1177_1073191120976863 for On Dimensionality, Measurement Invariance, and Suitability of Sum Scores for the PHQ-9 and the GAD-7 by Jan Stochl, Eiko I. Fried, Jessica Fritz, Tim J. Croudace, Debra A. Russo, Clare Knight, Peter B. Jones and Jesus Perez in Assessment
Supplemental Material
sj-pdf-2-asm-10.1177_1073191120976863 – Supplemental material for On Dimensionality, Measurement Invariance, and Suitability of Sum Scores for the PHQ-9 and the GAD-7
Supplemental material, sj-pdf-2-asm-10.1177_1073191120976863 for On Dimensionality, Measurement Invariance, and Suitability of Sum Scores for the PHQ-9 and the GAD-7 by Jan Stochl, Eiko I. Fried, Jessica Fritz, Tim J. Croudace, Debra A. Russo, Clare Knight, Peter B. Jones and Jesus Perez in Assessment
Supplemental Material
sj-pdf-3-asm-10.1177_1073191120976863 – Supplemental material for On Dimensionality, Measurement Invariance, and Suitability of Sum Scores for the PHQ-9 and the GAD-7
Supplemental material, sj-pdf-3-asm-10.1177_1073191120976863 for On Dimensionality, Measurement Invariance, and Suitability of Sum Scores for the PHQ-9 and the GAD-7 by Jan Stochl, Eiko I. Fried, Jessica Fritz, Tim J. Croudace, Debra A. Russo, Clare Knight, Peter B. Jones and Jesus Perez in Assessment
Footnotes
Acknowledgements
We are extremely grateful to the IAPT teams who participated in this study and provided access to the required data. In addition, we thank the Norwich Clinical Trials Unit (NCTU) for its support managing exports of data.
Author Contributions
JS, JF, and EF designed analysis plan and wrote the first draft, JS carried out data analysis and interpretation and had full access to all the data in the study. All other authors contributed to subsequent versions of the article. JS takes responsibility for the integrity of the data and the accuracy of the data analysis.
Declaration of Conflicting Interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: JS discloses consultancy for IESO digital health. The remaining authors have no conflicts of interest.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This article presents independent research funded by the National Institute for Health Research (NIHR) under its Programme Grants for Applied Research Programme (Reference Number RP-PG-0616-20003). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health. PBJ, JP, and JS received support from the NIHR Applied Research Collaboration (ARC) East of England (NIHR200177). JF is funded by the Medical Research Council Doctoral Training/Sackler Fund and the Pinsent Darwin Fund.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
