Abstract
The Inventory of Complicated Grief (ICG) is a commonly used self-report measure in psycho-oncology, best supportive care, and palliative medicine. However, existing validation studies yielded conflicting results regarding the structural validity. This study provides a psychometric review and conceptual replication of the ICG latent structure to test the hypothesis that existing studies overfit unreliable sources of variance, which overshadow the unidimensionality of the ICG. All proposed latent models identified in the psychometric review were tested in a series of confirmatory and exploratory structural equation models. Specifically, at least five to six latent intercorrelated factors were necessary to reach acceptable model fit. However, a general CG factor accounted for most variance and ICG sum scores showed predictable associations with anxiety and depressive symptoms, which suggests that the ICG is essentially unidimensional. There are indications that other measures of pathological grief show similar inconsistencies. Overall, potentially emerging subfacets of the ICG should not be interpreted as distinct “symptom clusters.” If time constraints are an issue as is often the case in clinical research, complicated grief may just be measured by a reduced item set without a significant loss of information or complexity.
Keywords
Losing someone is an exceptionally distressing event (Shear, 2015) and will cause a usually limited period of grief. However, grieving is complicated by emotional, cognitive, and behavioral maladaptations in some cases (Prigerson, Maciejewski, et al., 1995). Often resulting in severely impaired psychosocial functioning (Shear, 2015) these maladaptations serve as the diagnostic rationale for Complicated Grief (CG) 1 (Maciejewski et al., 2016; Mauro et al., 2019; Simon et al., 2020). With a prevalence of up to 11% in the general population, CG poses a substantial disease burden both on the individual and the society (Lundorff et al., 2017). One of the most commonly applied (Treml et al., 2020) self-report measures for CG is the Inventory of Complicated Grief (ICG) (Prigerson, Maciejewski, et al., 1995). Originally, CG as measured by the ICG was conceptualized as a unidimensional construct (Jacobs et al., 2000; Prigerson, Maciejewski, et al., 1995). Although several validation studies supported the unidimensionality of the ICG (Carmassi et al., 2014; Lumbeck et al., 2012), some studies reported more complex solutions (with up to six latent factors, Fisher et al., 2017; Simon et al., 2011) and a recent review concluded that “results varied regarding the factor structure” (Treml et al., 2020, p. 424). Although Treml et al. (2020) provide a general overview of measurement instruments for CG, an in-depth review of the by far most frequently used one among clinicians, the ICG, against established quality criteria (e.g., the COnsensus-based Standards for the selection of health status Measurement INstruments [COSMIN] guidelines), is lacking. Without such, the underlying causes for the inconsistent factor structure and the extent to which the interpretability of ICG scores is threatened remain unknown. Specifically, neither existing validation studies nor the recent review by Treml et al. (2020) has addressed potential causes for insufficient factorial validity (e.g., Li & Prigerson, 2016; Treml et al., 2020) or have reasoned a posteriori that characteristics of the population under study determined the factorial structure of the ICG (e.g., Lifshitz et al., 2022). From a clinical and theoretical perspective, the assumption that different (statistical) study populations produce different latent signatures of CG seems plausible and luring. However, existing studies investigating the factor structure of the ICG did not specify hypotheses on how population effects influence factorial validity, but only reasoned such effects a posteriori which weakens the generalizability of such claims (Deffner et al., 2021). Epistemologically, the finding that the factorial structure of the ICG varies as the population under study varies is a mere correlational finding and as such only a sufficient condition for a causal population effect. The necessary condition for such a causal effect, that is, keeping the population under study constant should not affect the factorial structure of the ICG, has not been tested, yet. If it turns out, as we will show below, that this necessary condition is not met, a different explanation for the insufficient structural validity will be needed. Here, we argue that existing studies on structural validity overfitted unsystematic variance in latent measurement models and underestimated the extent to which a general CG factor accounts for reliable variance (Reise, Bonifay, et al., 2013; Rodriguez et al., 2016). Whereas a lack of unidimensionality would pose a conceptual threat to CG, in practice, ICG total scores might be readily interpretable and might reliably predict other constructs (Luo & Al-Harbi, 2016; Reise et al., 2007; Reise, Bonifay, et al., 2013; Rodriguez et al., 2016). The essential unidimensionality of the ICG would offer a simple explanation for the disagreement on the number and content of ICG subfacets in the existing literature without assuming that a potentially unlimited number of population-specific variables have strong influences on the factorial structure of the ICG. In addition, clinical researchers or clinicians could interpret ICG sum scores without having to assume that the expression of CG drastically varies from one patient population to another.
This systematic psychometric review and conceptual replication study therefore (a) systematically reviews and benchmarks the measurement properties of the ICG with a focus on structural validity against well-established and easily interpretable quality criteria, that is, the COSMIN (Mokkink, de Vet, et al., 2018) to gain a more detailed psychometric assessment of the ICG than the existing review by Treml et al. (2020) offered; (b) evaluates and compares proposed measurement models for the ICG and assesses the degree of multidimensionality in ICG items; and (c) evaluates whether the ICG is essentially unidimensional. The overarching aim is to provide a psychometric synopsis of the ICG relevant both to clinical practice and theoretical research.
Method
Study Design and Recruitment
First, we conducted a systematic review of studies reporting measurement properties of the ICG following the COSMIN standards (Mokkink, Prinsen, et al., 2018). Second, to assess multidimensionality, we conducted confirmatory factor analysis (CFA)/bifactor-CFA and exploratory structural equation modeling (ESEM)/bifactor-ESEM analyses to disentangle different sources of construct variance (Morin, Arens, & Marsh, 2016; Morin, Arens, Tran, et al., 2016) in two independent samples of caregivers of bereaved German people with cancer (Sample 1 and replication Sample 2). Third, we assessed whether ICG scores are essentially unidimensional (Reise, Bonifay, et al., 2013; Rodriguez et al., 2016) and provide further evidence for some of the psychometric properties of the ICG. Both samples were approved by the Institutional Review Board of the Medical Faculty of Heidelberg University. We report all data exclusions, all manipulations, and all measures in the study.
Sampling Procedure for Sample 1
For Sample 1, we identified bereaved caregivers of cancer patients from the Cancer Registry of the National Center for Tumor Diseases Heidelberg (NCT) and administered paper pencil surveys via mail. Tumor types were restricted to colon, pancreatic, lung, breast, and prostate cancer. All adult caregivers who were able to give informed consent were eligible. Exclusion criteria comprised patient’s death not longer than 6 months ago and enrolment of the bereaved caregivers in a cancer treatment trial. A more detailed description of the sampling procedure can be found in Tönnies et al. (2021).
Sampling Procedure for Sample 2
For Sample 2, we identified 21 German grief support websites and online groups in a systematic online search and invited caregivers of deceased individuals to fill in an online survey. We also invited participants to take part in a follow-up survey. A detailed description of the sampling procedure can be found in Haun et al. (2019).
Participants in Sample 1
From 1,138 patients who were detected in the Cancer Registry of the NCT, 646 bereaved caregivers were identified as potentially eligible. In total, 298 caregivers participated in the survey (response rate: 46.1%). Accounting for multivariate outliers on ICG items (Mahalanobis D2) and missing data for the ICG total score, 259 datasets were analyzed (cf. Tönnies et al., 2021). Table 1 shows the participant characteristics. Histograms of ICG scores, along with estimated polychoric correlations between ICG items can be found in Online Appendix C.
Sample Characteristics for Both Samples.
Note. M = mean; SD = standard deviation; ICG = Inventory Of Complicated Grief; PHQ-9 = Patient Health Questionnaire; GAD-7 = Generalized Anxiety Disorder Scale.
For metric outcomes, independent t tests for unequal variances were conducted; for categorical outcomes, χ2 tests were conducted.
Variables containing missing data do not add to the total sample size.
Participants in Sample 2
In total, 559 individuals were eligible of whom 365 participated in the survey (response rate: 65.3 %). Accounting for multivariate outliers on ICG items (Mahalanobis D2) and after excluding two cases with missing data on all ICG variables, 359 datasets were analyzed in this replication study (cf. Haun et al., 2019). Table 1 shows the participant characteristics. Sixty participants filled in the retest survey. The median time interval between test and retest was 12 weeks. Histograms of ICG scores, along with estimated polychoric correlations between ICG items can be found in Online Appendix C.
Measures for Sample 1 and Sample 2
Inventory of Complicated Grief
The ICG consists of 19 items (Lumbeck et al., 2012). Individuals describe the currently experienced emotional, cognitive, and behavioral states on a 5-point unipolar frequency scale from “0” (never) to “4” (always).
Patient Health Questionnaire–9
The PHQ-9 is a widely used brief depression severity measure with high validity and reliability which scores each of the nine Diagnostic and Statistical Manual of Mental Disorders (4th ed.; DSM-IV; American Psychiatric Association, 1994) criteria as “0” (not at all) to “3” (nearly every day) (Kroenke et al., 2001).
Generalized Anxiety Disorder–7
The seven-item Generalized Anxiety Disorder–7 (GAD-7) is a valid self-report anxiety questionnaire with unidimensional structure and good internal consistency (Hinz et al., 2017). On a 4-point unipolar frequency scale, individuals indicate how often they have experienced symptoms of generalized anxiety during the last 2 weeks.
Part 1: Psychometric Review
We conducted a review of studies reporting any measurement property of the ICG comprised in the COSMIN standards except content validity (Mokkink, de Vet, et al., 2018). Four electronic databases (MEDLINE, PsycInfo, Web of Science, CINAHL) were searched until February 11, 2022 using instrument name and measurement properties as search terms (Terwee et al., 2009) (Online Appendix A). All studies reporting any COSMIN measurement property in any study population were included. We excluded studies that merely reported correlations of ICG scores with other constructs or studies that reported measurement properties of instruments adapted from the original ICG (e.g., the ICG-revised, O’Connor et al., 2010). Two reviewers (AS and MWH) independently screened titles and abstracts of retrieved articles for relevance. Evidence for measurement properties was retrieved independently by two reviewers (AS and JT) and assessed using the COSMIN criteria. For individual studies, evidence on each measurement property is rated from “inadequate,” “doubtful,” “adequate” to “very good.” Evidence on measurement properties is then synthesized across studies as sufficient (+), insufficient (−), inconsistent (+/−), or indeterminate (?). The quality of evidence of each measurement property was rated “high,” “moderate,” “low,” or “very low” according to the Grading of Recommendations Assessment, Development, and Evaluation approach (Mokkink, Prinsen, et al., 2018; Terwee et al., 2012). For a more detailed description, see Online Appendix A.
Part 2: Multidimensionality
To assess the sources of multidimensionality of the ICG, we conducted a series of (Bifactor-) CFA and (Bifactor-) ESEM analyses (Morin, Arens, & Marsh, 2016; Morin, Arens, Tran, et al., 2016) for all latent models of the ICG identified in the psychometric review in both samples, that is Sample 1 and Sample 2. Analyses were carried out using Mplus v8.4 (Muthén & Muthén, 2019). First, we conducted CFA and ESEM analyses with oblique target rotation. Second, we estimated orthogonal Bifactor-CFA and orthogonal Bifactor-ESEM models with target rotation in both samples to disentangle the amount of variance that is explained by one general CG factor and orthogonal group factors. Group factors correspond to the proposed latent factors of the ICG. For target rotation, we considered the target to be reached if items loaded most highly on the specified factor (structural replicability, Osborne & Fitzpatrick, 2012, see also Litalien et al., 2017). To account for the Likert-type scale of the ICG, all parameters were estimated using Weighted-Least-Squares estimation with mean and variance adjustment (WLSMV). For model evaluation, we inspected the statistical significance of model parameters along with goodness-of-fit indices (χ2, comparative fit index, comparative fit index [CFI], standardized root-mean-square residual [SRMR], and root-mean-square error of approximation [RMSEA]). As commonly applied criteria to evaluate model fit (e.g., Hu & Bentler, 1999) based on maximum-likelihood (ML) estimated models are biased (i.e., increased probability of accepting misspecified models) when applied to WLSMV estimated models (Nye & Drasgow, 2011; Shi & Maydeu-Olivares, 2020; Shi et al., 2020; Xia & Yang, 2019), we obtained approximate sampling distributions of model fit indices if models are correctly specified and misspecified using Monte-Carlo-Simulations (see Online Appendix D for details) in R 4.0.2 (R Core Team, 2020). In particular, we repeatedly simulated data from a “true” model and fitted either correctly specified or incorrectly specified SEM models to this data using WLSMV and recorded CFI, RMSEA, and SRMR. The resulting distributions were then used to obtain cut-off values that minimize Type-II-error (
Part 3: Unidimensionality and Psychometric Properties
Unidimensionality
Omega coefficients derived from bifactor models allow us to assess essential unidimensionality (Reise, Scheines, et al., 2013; Rodriguez et al., 2016). For reliability, we report coefficient
Psychometric Properties
Measurement properties of the ICG are analyzed in accordance with the COSMIN guidelines and the COSMIN study design checklist (Mokkink et al., 2019).
Construct Validity
To illustrate our results on unidimensionality, we compared correlations between comparator constructs (depression and anxiety as measured by the PHQ-9 and GAD-7) and ICG total scores extracted from a well-fitting multidimensional solution with correlations between comparator constructs and ICG total scores extracted from the unidimensional solution in both samples. We hypothesized ICG scores to be moderately correlated with PHQ-9 scores and GAD-7 scores (e.g., Lumbeck et al., 2012; Prigerson, Frank, et al., 1995), irrespective of the factor extraction method. To estimate these correlations, we extracted factor scores from unidimensional CFA models for ICG, PHQ-9 and GAD-7 and from the best fitting Bifactor ESEM model of the ICG (Litalien et al., 2017; Morin, Meyer, et al., 2016; Skrondal & Laake, 2001) in both samples.
Reliability and Measurement Error
To assess reliability and measurement error, we analyzed the subset of the web survey sample that completed the ICG twice (n = 60). We assumed that ICG scores were stable in the interim period. We computed the standard error of measurement (SEM) along with the smallest detectable change (SDC). For test–retest reliability we calculated the intraclass correlation coefficient (ICC; A,1; two-way mixed-effects model with absolute agreement).
Responsiveness
Following an exploratory approach, we analyzed associations between GAD-7, PHQ-9, and ICG change scores between T1 and T2 in the test–retest subset of Sample 2 (n = 60). Specifically, we tested whether depressive or anxiety symptoms at T1 were associated with ICG change scores between T1 and T2.
Results
Part 1: Psychometric Review
Figure 1 shows a flowchart of the review process. We identified full-text articles for 13 studies published between 1995 and 2020 that report measurement properties of the ICG. Of those, 11 studies were included in the review. Table 2 summarizes the existing evidence on measurement properties. For a detailed description of reported results see Table A3 in Online Appendix A. We found that, except for internal consistency (nine studies of very good quality), criterion validity (one study of very good quality), 2 and construct validity (seven studies of very good quality), 3 there is only weak evidence to support each of the remaining measurement properties of the ICG. There was no study reporting on responsiveness or measurement error. Only two studies provided evidence of moderate quality (downgraded for inconsistency) for measurement invariance with respect to sample type (clinical vs. non-clinical, Fisher et al., 2017) and for differential item functioning between individuals scoring above 25 on the ICG and those scoring below (Masferrer et al., 2017). Most strikingly, we found that evidence on structural validity of the ICG is highly inconsistent, stemming from studies that applied non-optimal methods (only one study reached the level of very good quality, but overall quality of evidence regarding structural validity had to be downgraded to low due to inconsistency). We identified eight different factor models proposed to capture the latent structure of the ICG, ranging from unidimensional models (e.g., Prigerson, Maciejewski, et al., 1995, 5/11 studies), three-factor models (3/11 studies) with factor intercorrelations (Thimm et al., 2019) and without factor intercorrelations (Li & Prigerson, 2016; Lifshitz et al., 2022), four-factor models (Masferrer et al., 2017), to six-factor models with factor intercorrelations (Fisher et al., 2017) and without factor intercorrelations (Simon et al., 2011). Most studies applied exploratory factor analysis (EFA)/principal components analysis (PCA) (e.g., Lifshitz et al., 2022; Lumbeck et al., 2012, 7/11 studies). Most EFA/PCAs were based on Pearson correlations (4/7 studies). Three studies (Carmassi et al., 2014; Masferrer et al., 2017; Simon et al., 2011) applied EFA on tetrachoric correlation matrices based on dichotomized items. None of the studies accounted for ordinal data. Five studies applied confirmatory analyses (either CFA or EFA/CFA combined).

Flowchart Displaying the Process of Study Selection.
Evidence Synthesis.
Note. NA = not applicable.
Part 2: Multidimensionality
Of the eight identified models, one model was based on only 17 items (Lifshitz et al., 2022), and one model (Fisher et al., 2017) was based on 16 items, leaving five models eligible for comparison within both samples. In addition, we tested a two-factor solution based on the conceptual distinction between traumatic grief and separation distress discussed in the literature (e.g., Jacobs et al., 2000; O’Connor et al., 2010). Because the two-factor solution had only been tested with the ICG-r (O’Connor et al., 2010), we assigned items to each factor based on eyeball validity. For testing all models, we assumed factors to be correlated. Final models are displayed in Figure 2. All standardized parameter estimates for all models in both samples reported below are displayed in Online Appendix B. In a first step, we tested for convergence and whether implicated covariance matrices were positive definite. In Sample 1 all CFA models converged, but Models 5 and 6 did not produce positive definite covariance matrices. Inspection yielded that the two factors discomfort and non-acceptance of model 5 as well as the two factors Yearning and Behavioral Change of Model 6 were linearly dependent (rs > .95). To subject these models to CFA, we collapsed the respective factors 4 (resulting in Model 5b and Model 6b). Table 3 shows CFA results in Sample 1. All models failed to reach adequate model fit. Although in Sample 2, Model 5b did not produce a positive definite model implied covariance matrix, we could replicate the results in general (Table B1 in Online Appendix B). That is, none of the models showed adequate model fit. Turning to Bifactor-CFA, in Sample 1, only Bifactor-Models 3 and 4 converged and implied positive definite covariance matrices but showed inadequate model fit (Table 3). This was replicated in Sample 2, where only Model 4 converged with inadequate model fit (Table B1). In sum, both CFA and Bifactor-CFA models did not capture the latent structure of the ICG to a satisfactory extent. Next, we conducted ESEM analyses for all models, to assess whether unspecified cross-loadings between factors explain the inadequate fit (Asparouhov & Muthén, 2009). For Sample 1, all models showed better fit (Table 3). However, this came at the cost of increased model complexity as indicated by higher values for BIC and AIC. Of the ESEM models, Model 6 showed close to adequate model fit. This result was replicated in Sample 2, where Model 6 even reached close to good model fit (Table B2). Note further, that factor intercorrelations were reduced dramatically for all models (e.g., from r = .943 to r = .411 in the two factor model; see Table B13 in Online Appendix B), which indicated that cross-loadings were present (Morin, Arens, & Marsh, 2016). However, we find that neither in Sample 1 nor in Sample 2, the target model was reached with three items in each study loading higher on a non-target factor (Tables B9 and B10 in Online Appendix B). Moreover, in each study, different items failed to meet the target (Items 1, 5, and 12 in Sample 1 and Items 2, 11, and 17 in Sample 2). Next, we subjected all models to Bifactor-ESEM with orthogonal target rotation. We found that in Sample 1, while the covariance matrix implied by Model 6 was not positive definite, Model 6b showed the best model fit 5 (Table 3). In Sample 2, this result was replicated (Table B2 in Online Appendix B). Again, the target was not met with five items diverging from the target in Sample 1 and seven items in Sample 2 (Table B9 and B10 in Online Appendix B). Again, different items did not meet the target (Items 1, 2, 7, 8, 12 in Sample 1 and Items 1, 2, 12, 13, 16, 18, and 19 in Sample 2). In summary, none of the proposed latent models could be replicated in either study. To better interpret the factor loadings of the well-fitting Bifactor-ESEM Model 6b (five group factors plus one general factor), we fitted a five-factor ESEM model and a 5+1-factor Bifactor-ESEM model in Sample 1 using orthogonal geomin rotation. We specified significant loadings (see Table B11) as target loadings to be reached by the same models fitted with target rotation in Sample 2. For ESEM, we found that 11 items and for Bifactor-ESEM three items diverged from this target (Table B12). In sum, the 5- and 5+1-ESEM and Bifactor-ESEM models provided acceptable/good fit to the data but did not produce loading patterns that were replicable between studies. Nevertheless, the acceptable model fit made it possible to use these models to assess essential unidimensionality.

Latent Factor Models Identified in the Psychometric Review and Eligible for Replication.
Results of CFA/Bifactor ESEM/Bifactor-ESEM of Latent ICG Models (Sample 1, N = 259).
Note. CFA = confirmatory factor analysis; ESEM = exploratory structural equation modeling; ICG = Inventory of Complicated Grief; CFI = comparative fit index; RMSEA = root-mean-square error of approximation; SRMR = standardized root-mean-square residual; BIC = Bayesian information criterion; AIC = Akaike information criterion.
p < .001.
Part 3: Unidimensionality and Psychometric Properties
Unidimensionality
All
Bifactor-Model Derived Omega and Hierarchical Omega Coefficients in Both Samples.
Note. ECV = explained common variance.
Construct Validity
Although in Sample 1 there was no missing data on the PHQ-9 or GAD-7, we excluded 12 participants with missing values on PHQ-9 and/or GAD-7 listwise, resulting in a total sample of n = 347 for Sample 2. We fitted unidimensional CFA models to measure PHQ-9, Sample 1:
Responsiveness
Higher levels of depressive and anxiety symptoms at T1 were not associated with ICG change scores between T1 and T2, r = .075[−.182, .323], t(58) = .575, p = .567, and r = .123 [−.132, .368], t(58) = .965, p = .339, respectively.
Reliability and Measurement Error
The median time interval between test and retest amounted to 12 weeks (interquartile range: 6 weeks), which we considered long enough to prevent recall bias. The median proportion of weeks between test and retest of weeks since bereavement was 60% (interquartile range: 73%). The ICC amounts to ICC(A,1) = .908, 95% confidence interval (CI) = [.85, .94]. The SDC in ICG scores derived from the SEM of 1.853 (n = 60) amounts to SDCind = 5.135 for individuals, and SDCgroup = .663 for comparisons between groups. In our sample, 22 individuals (36.7%) showed change scores between T1 and T2 that exceeded SDCind. Bland-Altman analysis indicated no systematic bias between both administrations, bias = −1.15, 95% CI = [−2.72, 0.42].
Discussion
The present study is the first to (a) provide an in-depth, and easily interpretable psychometric review of the ICG by synthesizing evidence on measurement properties using the COSMIN standards, (b) to systematically test and compare proposed factor models for the ICG based on the psychometric review to assess the degree of multidimensionality in ICG items, and (c) to estimate the degree to which multidimensionality affects the interpretation of ICG as an essentially unidimensional construct. Regarding the psychometric soundness of the ICG, we found that despite the large use of the ICG in longitudinal studies (Milic et al., 2019; Wågø et al., 2017), there is no evidence on the responsiveness and the measurement error of the ICG, and only weak evidence on its reliability. Although existing validation studies as well as interventional studies using the ICG rely on highly heterogeneous samples (cf. Treml et al., 2020), there is only inconsistent evidence on its measurement invariance. Notably, we partially closed this gap by providing estimates of measurement error and ICC. However, the present study focused on the structural validity of the ICG. In this regard, in line with Treml et al. (2020), we find that results are highly inconsistent with proposed latent ICG models ranging from unidimensional to six-dimensional solutions. In total, we identified six distinct and testable latent models reported in the literature. Based on both confirmatory and exploratory latent modeling approaches, we find that assumptions of the independent clusters model CFA are too strict to be applied to factor models proposed for the ICG. This is likely due to the fact that the plethora of studies included in this review rely on EFA approaches (Marsh et al., 2010). One may argue that our findings contradict other results from earlier CFA studies on the ICG. For example, Fisher et al. (2017) reported good model fit for the six-factor model in a CFA analysis as did Ludwikowska-Świeboda and Lachowska (2019) for a unidimensional CFA solution. However, Fisher et al. (2017) applied commonly used ML cutoffs to WLSMV estimated models potentially leading to biased conclusions (Nye & Drasgow, 2011; Shi & Maydeu-Olivares, 2020). Furthermore, the fact that several residual correlations had to be freed indicates misspecification. Similarly, Ludwikowska-Świeboda and Lachowska (2019) specified residual correlations corresponding to the six factors reported in Simon et al. (2011) as well as additional residual correlations. In sum, these models might not at all reflect unidimensionality but rather several minor factors (cf. Brown, 2015) necessary to achieve acceptable model fit. In our data, systematic improvement in model fit was instead achieved by allowing for cross-loadings in the ESEM framework. But even when allowing for cross-loadings, only a five-factor solution (Sample 1) and a six-factor solution (Sample 2) reached acceptable model fit, without reaching the specified target in either study at all. Even worse, the good fitting ESEM solution could not be replicated across datasets. Only one factor (Factor 2) showed moderate evidence for replicability between studies. Upon further inspection, this factor consisted of four items targeting yearning for the deceased (Item 4), inability to trust others (Item 9), estrangement from others (Item 10) and avoiding to remember the deceased (Item 12). In our view, this observation should be interpreted with caution as content similarity between these statements seems to be rather low, and this factor was not present in either model identified in the psychometric review. In addition, we found that construct validity as measured by associations of ICG scores with depression and anxiety symptoms as measured with the PHQ-9 and GAD-7, respectively, did not depend on whether the ICG was modeled as a unidimensional construct or as a general factor with several group factors. This illustrates that the general factor not only captures most of the variance in ICG scores but also shows predictable associations with other constructs. Of course, the generalizability of this result depends on the choice of comparator instruments. Depression and anxiety were, however, most often used to investigate construct validity of the ICG and thus of high importance for the present study. Although the investigation of measurement properties of PHQ-9 and GAD-7 was beyond the scope of this study, we found that both instruments showed inadequate fit with a unidimensional CFA model. This is in line with recent research suggesting that both PHQ-9 and GAD-7 show inconsistent factorial validity (Boothroyd et al., 2019; Johnson et al., 2019; Maroufizadeh et al., 2019). Although further exploring the latent structure of these constructs poses a task for further research, PHQ-9 and GAD-7 sum scores (implicitly assuming unidimensional constructs) represent important clinical outcomes and are the basis for screening depression and anxiety disorders in primary care which is usually the point of entry to care for people with complicated grief (Boothroyd et al., 2019; Costantini et al., 2021; Sapra et al., 2020). Therefore, we do not see this as a threat to our main conclusions. In general, our results regarding the ICG are in line with emerging evidence, that a vast number of psychological self-report measures fall prey to overfitting in validation studies (Rodriguez et al., 2016), which in turn impedes psychometric replication. Although some authors suggested their factor solutions to represent the latent nature of CG (e.g., Simon et al., 2011), we took a different perspective. We argued that to assess the degree to which apparent multidimensionality threatens the application of sum scores, one good fitting solution had to be found. We did not assume this solution to reflect latent psychological dimensions but rather unsystematic noise (cf. Cho et al., 2014). In fact, our results indicate that the ICG is essentially unidimensional, which is not to be equated with good model fit of a unidimensional CFA model (Reise, Bonifay, et al., 2013; Ziegler & Hagemann, 2015). With an increasing number of items, unidimensional CFA models will almost always lead to insufficient fit (Bentler, 2009). Taken together, (a) the degree of unsystematic variance inherent in the ICG items and (b) the large number of items to assess the unidimensional construct make it impossible to find a replicable measurement model for the ICG, while in practice sum scores offer a reliable estimate of CG. 6 Consequently and given that there are no clear theoretical assumptions against which the latent structure of the ICG could be tested (except the unidimensional model proposed by Prigerson, Maciejewski, et al., 1995), structural equation modeling is not a well-suited method to evaluate the structural validity of the ICG as it will produce misleading and unreliable results. In fact, the inconsistencies regarding the factorial validity of the ICG might stem from validation studies overfitting unsystematic sources of variance using complex structural equation models, that are then interpreted as reflecting the latent nature of the construct of interest. One solution to resolve this problem might thus lie in reducing the number of ICG items based on clearly articulated theoretical grounds. For example, the ICG-derived Prolonged Grief Disorder-13 (PG-13) (Prigerson, Boelen, et al., 2021) consists of only 13 items to measure prolonged grief. However, there seems to be some indication that the PG-13 faces the same form of structural inconsistency as the ICG. For example, two recent validation studies of the PG-13 conclude that the PG-13 forms a unidimensional scale based on the eigenvalue criterion (Pohlkamp et al., 2018; Prigerson, Boelen, et al., 2021). Another study that applied CFA reports that the PG-13 forms a two- to three-dimensional construct (Sveen et al., 2020) mirroring the inconsistent results regarding the ICG. This is in line with some authors arguing that the eigenvalue criterion alone does not provide sufficient evidence for unidimensionality unless complemented with other methods of analysis and strong theoretical arguments (Ziegler & Hagemann, 2015). In conclusion, the findings of the present study might very well also apply to measures derived from the ICG suggesting that not only the mere number of items of the ICG introduces unsystematic variance in item correlations, but also other factors such as the lack of theoretical scale development (i.e., item pool reduction for the ICG as well as the PG-13 was mainly based on empirical findings). Some limitations regarding the generalizability of these findings come to mind. As indicated earlier, validation studies of the ICG have relied on heterogeneous samples, which complicates evidence synthesis across studies. The present study relied on a specific population, namely bereaved caregivers of people with cancer in Germany which poses a potential constraint on generalizing our results. The samples in the present study were, however, similar to samples of other validation studies with respect to age, sex, relationship to the deceased, and mean ICG scores were neither extremely high nor extremely low in comparison to those reported in other studies (see Table A3 in Online Appendix A). As we held the study population constant between Sample 1 and Sample 2 and were still not able to replicate a measurement model to a satisfactory extent it is unlikely that population effects were the main drivers behind the structural inconsistencies reported in the literature. Still, sample effects (i.e., variation in samples from the same population affect the latent structure of the ICG) could explain our findings. Combining the findings from the psychometric review, SEM analyses and analyses on essential unidimensionality, we conclude that essential unidimensionality offers a more parsimonious and constructive explanation. That is, (a) if researchers needed to interpret the ICG differently in each sample (even if samples stem from the same population), then the generalizability of findings involving the ICG were severely constrained and (b) if inconsistent results regarding the factor structure of the ICG could be attributed to observable (or not observable) sample characteristics, then the ICG would be immune against threats on the structural validity and the statement that the instrument is structurally valid became tautological. Instead, essential unidimensionality allows to explain inconsistent factorial validity without immunizing the instrument against threats on structural validity and allows researchers to interpret ICG scores sensibly. Further research should, however, investigate potential sampling effects by controlling for sample characteristics using for instance post-stratification procedures (e.g., Deffner et al., 2021) or measurement invariance approaches (e.g., Marsh et al., 2020).
Conclusion
Implications for Clinical Research and Practice
Our results suggest that the ICG is essentially unidimensional and that previous reports of several subfacets are misleading. For clinical researchers and clinicians alike, this finding has several important implications. First, evaluating associations of CG with other constructs of interest or investigating the extent of burden resulting from a loss is possible using ICG sum scores. Second, identifying associations of subfacets of CG with other constructs or investigating subfacet-specific CG profiles in a clinical population is neither necessary nor warranted. In fact, we would highly advise against the attempt to interpret potentially emerging subfacets of the ICG as distinct “symptom clusters” as other authors have suggested (Simon et al., 2011), because our results suggest that these clusters are not reliable and not replicable. Finally, if time constraints are an issue as is often the case in large survey studies or clinical contexts, then CG may just as well be measured by a reduced item set without a significant loss of information or complexity.
Implications for Psychometric Research
Our findings provide a parsimonious explanation for the seemingly paradox findings of the last 25 years of psychometric research on the ICG. On one hand, total scores reflect tantamount of variance in latent ICG scores and will produce predictable and reliable associations with other constructs such as depressive or anxiety symptoms. After all, it seems, that the factorial structure of the ICG is not as complicated as previous research on the ICG suggests. On the other hand, bad fit indices in CFA indicate a lack of unidimensionality and structural validity which undermines the theoretical conceptualization of the ICG. Hence, structural equation modeling is not well suited to evaluate the structural validity of the ICG in its current form. Further research should address these concerns through more theoretically inspired and parsimonious scale construction.
Supplemental Material
sj-docx-1-asm-10.1177_10731911221100980 – Supplemental material for The Inventory of Complicated Grief—A Systematic Psychometric Review and Conceptual Replication Study of the Structural Validity
Supplemental material, sj-docx-1-asm-10.1177_10731911221100980 for The Inventory of Complicated Grief—A Systematic Psychometric Review and Conceptual Replication Study of the Structural Validity by Alexander Schakowski, Justus Tönnies, Hans-Christoph Friederich, Mechthild Hartmann and Markus W. Haun in Assessment
Footnotes
Acknowledgements
We would like to thank Ariane Preibsch, MSc, for assisting in the data collection.
Author Contributions (CRediT)
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical Standards
The authors assert that all procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Financial support for this study was provided entirely by a grant from the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) (Grant number HA 7536/1-1). The funding agreement ensured the authors’ independence in designing the study, interpreting the data, writing, and publishing the report.
Supplemental Material
Supplemental material for this article is available online.
Data Availability Statement
The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.
