Abstract
Data aggregation in mental health is complicated by using different questionnaires, and little is known about the impact of item harmonization strategies on measurement precision. Therefore, we aimed to assess the impact of various item harmonization strategies for a target and proxy questionnaire using correlated and bifactor models. Data were obtained from the Brazilian High-Risk Study for Mental Conditions (BHRCS) and the Healthy Brain Network (HBN; N = 6,140, ages 5–22 years, 39.6% females). We tested six item-wise harmonization strategies and compared them based on several indices. The one-by-one (1:1) expert-based semantic item harmonization presented the best strategy as it was the only that resulted in scalar-invariant models for both samples and factor models. The between-questionnaires factor correlation, reliability, and factor score difference in using a proxy instead of a target measure improved little when all other harmonization strategies were compared with a completely at-random strategy. However, for bifactor models, between-questionnaire specific factor correlation increased from 0.05–0.19 (random item harmonization) to 0.43–0.60 (expert-based 1:1 semantic harmonization) in BHRCS and HBN samples, respectively. Therefore, item harmonization strategies are relevant for specific factors from bifactor models and had little impact on p-factors and first-order correlated factors when the child behavior checklist (CBCL) and strengths and difficulties questionnaire (SDQ) were harmonized.
Consistent advances in mental health sciences usually require the aggregation of samples that, most of the time, can only be achieved by integration and harmonization of phenotypic data. Harmonization refers to combining different data sets and matching compatible variables that belong to a construct of interest. Harmonization of phenotypic data has recently helped to understand the associations of brain development and genetics with dimensions of mental health problems on a much larger scale than before (Bethlehem et al., 2022; Mansolf et al., 2020; Thompson et al., 2022). As an example from neuroscience, mental health has been aggregated within specific constructs, such as depression, general anxiety, and attention-deficit and hyperactivity disorders, so that underlying brain morphology can be analyzed with larger samples (Bethlehem et al., 2022; Giedd, 2019; Schmaal et al., 2016; Zugman et al., 2022). However, data aggregation is frequently complicated by using different phenotypic assessment tools (e.g., questionnaires and interviews), which are nested in each sample. The challenge is that researchers have to make several decisions to guarantee harmonization across different studies without estimating how different the results would be due to using different questionnaires.
Furthermore, the harmonization of different questionnaires may also be impacted by the type of model used, such as a first-order correlated model and a bifactor model (Heinrich et al., 2020; Moore et al., 2020). The first-order correlated model represents a multidimensional structure in which the covariance between symptoms is given by specific latent constructs that are correlated. In a bifactor structure, the shared variance among all symptoms is explained by a general factor, and covariance not explained by this general factor can be explained by uncorrelated specific factors (Reise, 2012). Therefore, this study aims to estimate the impact of item-wise harmonization strategies for aggregating different phenotypic assessment tools using the correlated and bifactor models of psychopathology. This estimation would help to understand the best ways to harmonize phenotypic data using distinct questionnaires, which is crucial to support large scientific initiatives and improve measurement precision.
Item-Level Questionnaire Harmonization
Previous evidence on harmonizing data sets using item-level data frequently relies on the Integrative Data Analysis (IDA) framework (Curran & Hussong, 2009), that is, the analysis of a single data set that consists of two or more separate samples that have been pooled into one, frequently meaning different mental health questionnaires. One critical step for IDA is item selection. Best current practice relies on the expert-based selection of items to match them semantically (i.e., based on face validity), followed by testing factorial invariance between different assessment tools (Bauer & Hussong, 2009; Gondek et al., 2021; Luningham et al., 2020; Mansolf et al., 2020; McArdle et al., 2009; McElroy et al., 2021; B. Muthén & Asparouhov, 2014).
Despite being widely used, the advantages of harmonizing items semantically are largely unknown. First, advantages for harmonization might occur at the level of a higher order domain (e.g., internalizing vs. externalizing items) rather than at the level of the item. Quantifying the benefits of item versus construct matching reveals how each strategy adds to measurement precision. Second, harmonization might be better when other item properties are considered, despite semantic similarity being the norm. For example, harmonizing items by their response frequency might be a way to match items by how people are quantitatively responding to them instead of the content of the items. Third, instead of matching items by face validity (involving human judgment), one alternative new approach is using natural language processing (NLP). NLP is a promising approach for item-wise harmonization. It has been used, for example, to quantify the semantic similarity of open-ended answers to mental health questions (Kjell et al., 2019), to find synonyms from different theoretical models and frameworks in neuroscience (Beam et al., 2021), and to generate databases of medical terms to promote better integration for medical records data (Grossman Liu et al., 2021). Fourth, instead of doing one-by-one (1:1) item matching, one could use a combination of items (i.e., parceling) based on semantic similarity. Combining n items of a given questionnaire to match with one or a combination of items of another can enhance the accuracy of the item-wise harmonization between questionnaires.
Finally, two additional limitations still merit mentioning. First, IDA usually aims to harmonize unidimensional constructs (Curran & Hussong, 2009; Gondek et al., 2021; Luningham et al., 2020; McElroy et al., 2021), and the impact of harmonization procedures while using more complex and multidimensional models of psychopathology, such as the bifactor model, has never been examined. This model is structured to have a general factor (p-factor) that has low sensitivity for indicator replacement (as long as the questionnaire is the same) and residual specific factors with varying reliability (Hoffmann et al., 2022; Levin-Aspenson et al., 2021; Watts et al., 2019). Therefore, the gains for semantic matching might be minimal to nonexistent for the p-factor and greater for the specific factor if precise item matching is made. Moreover, comparisons between measurement models are performed by contrasting model fit, reliability, and factorial invariance using distinct samples. However, the real impact of using different questionnaires in distinct samples can only be measured when samples have both questionnaires. A metric to quantify the deviance between different measures could estimate the impact of having a target (i.e., the default assessment tool) and a proxy questionnaire (i.e., the assessment tool that is being integrated/harmonized) when harmonizing different questionnaires in different samples (Thissen et al., 2015; Tulsky et al., 2019).
Present Research
To address these gaps, our study tested six item-wise harmonization strategies. We sought to test the performance of these strategies by examining each factor model, the correlation of factor scores given each questionnaire, factor reliability, the questionnaire’s invariance, and a new metric, the Root Expected Mean Square Difference (REMSD), that estimates the factor score difference caused by using a proxy measure instead of a target measure. To do so, we used two data sets containing two questionnaires, a target and a proxy questionnaire, and computed the new indicator of deviance between the target and the proxy measure, called the REMSD, modified from Tulsky et al. (2019). We hypothesized that the between-questionnaire factor score correlation, factor reliability, and REMSD will improve when systematic item harmonization strategies are applied in relation to random item harmonization. Furthermore, we hypothesized that NLP would prove to be the best item-matching strategy for questionnaire harmonization, as it has for open-ended answers (Kjell et al., 2019), as well as semantic parceling harmonization due to unconstrained item matching in the same quantity, when compared with other strategies.
Method
Sample
The samples comprise two large-scale developmental cohorts within the Reproducible Brain Charts (RBC) initiative. A complete description of the RBC samples can be found in Hoffmann et al. (2022). The present study contains a child behavior checklist (CBCL) and strengths and difficulties questionnaire (SDQ) data from the Brazilian High-Risk Cohort Study for Mental Conditions (BHRCS), a community-based sample (Salum et al., 2015), and the Healthy Brain Network (HBN), a treatment-seeking sample (Alexander et al., 2017). For this study, we have included baseline data from all BHRCS participants (n = 2,511, aged 6–14, 45% females) and HBN participants with CBCL data (n = 3,629, aged 5–22, 36% females). The final sample comprised 6,140 subjects aged 5 to 22, with 39.6% females. The sample size is considered sufficient for factor analysis (Mundfrom et al., 2005).
Questionnaires
Child Behavior Checklist
CBCL is a 120-item parent-report assessment of current emotional and behavioral symptoms in subjects aged 6 to 18 over the past 6 months, answered in a three-point scale (0 = not true, 1 = somewhat/sometimes true, and 2 = very true/often). We used 116 items that encompass eight syndromes: anxious-depressed, withdrawn-depressed, somatic complaints, rule-breaking behavior, aggressive behavior, social problems, thought problems, and attention problems (Achenbach & Rescorla, 2001). In the present study, we extended the recommended age range (6–18 years) to 5 to 22 years based on previous findings that tested this appropriateness (Hoffmann et al., 2022).
Strengths and Difficulties Questionnaire
SDQ is a 25-item parent-reported screening tool to assess current emotional, hyperactivity, conduct, and social problems plus social skills questions, with five items in each construct (Goodman et al., 2000). It is scored similar to CBCL, in a three-point scale (0 = not true, 1 = somewhat true, and 2 = certainly true). Items 7, 11, 14, 21, and 25 were reverse-coded. We excluded the prosocial construct for the present study because it does not fit a psychopathology assessment. Therefore, we used 20 items from SDQ.
Study Design
We benchmarked six strategies for harmonizing the CBCL and SDQ within a correlated and a bifactor model framework in two samples assessed with both questionnaires. It proceeded in the following steps, explained in detail below: (a) to harmonize items according to six strategies, (b) to estimate correlated and bifactor models with harmonized items for each sample separately, and (c) to test the performance of these item-wise harmonization strategies for these factor models. To accomplish the latter, we examined the factor score correlation between questionnaires in each sample, their factor reliability, tested the questionnaire’s factorial invariance according to each strategy, and calculated the REMSD to estimate the factor score difference of using a proxy measure instead of a target measure while integrating the two samples.
Item Harmonization Strategies
We used two null and four designed strategies to harmonize CBCL and SDQ items as depicted in Figure 1.

Panel Demonstrating the Item Harmonization Strategies for CBCL and SDQ.
The first two null strategies used as comparator strategies were a completely at-random item harmonization and random within constructs/subscales. The first matched items 1:1 regardless of the a priori construct that the item belongs to (this model serves the purpose of defining a baseline model to be compared with incremental strategies). The second matched items 1:1 within the a priori constructs based on the construct harmonization between CBCL and SDQ (described below). This would help to quantify the benefits of item versus construct harmonization, as mentioned earlier. Items were selected randomly for each questionnaire using the “sample” function in base R. Because it is likely that factor models using items harmonized at random would not converge, we used a sequential seed until finding an item set that resulted in converging correlated and bifactor models. The random number generator was seeded as 1 for the first estimation of correlated and bifactor models and added by 1 for each next model estimation.
After these null strategies, we designed four item matching strategies. The first was the response frequency ranking harmonization, which matched items according to how often they were not endorsed (response = 0), according to rank order, within constructs. The CBCL and SDQ have four compatible a priori constructs in which items can be harmonized: internalizing, externalizing, attention/hyperactivity, and social problems (Achenbach & Rescorla, 2001; Kóbor et al., 2013). The frequency ranking strategy considered response frequencies from each sample separately because we used and compared models in two samples (BHRCS and HBN).
The second was semantic harmonization using NLP. NLP was carried out using the Siamese Bidirectional Encoder Representation from Transformers (SBERT), a deep-learning framework for sentence embeddings (numeric vectors). The algorithm embeds sentences with numeric vectors. The angle between these vectors is used to calculate the cosine between them, representing how close the two sentences are in the vector space (Reimers & Gurevych, 2019). Thus, the cosine-similarity score varies from 0 (no similarity) to 1 (identical meaning). Technical details can be found in the data analysis section below. Both 116 CBCL items and 20 SDQ items were included in the NLP model, and we ranked the similarity score from the highest to the lowest. We selected the top nonrecurrent pairs of similar items according to the cosine score.
The third strategy was expert-based semantic harmonization with 1:1 item matching. The content agreement was rated by two researchers (M.S.H. and L.K.A.). Any disagreement was decided by a third rater (G.A.S.). Raters aimed to maximize the number of items to be harmonized. Thus, if two or more items from one questionnaire matched a single item in the other questionnaire, one item was chosen by the rater, and the item left out was assessed for harmonization with the remaining items.
The fourth was the expert-based semantic harmonization with item parceling (i.e., combining one or multiple CBCL items with one or multiple SDQ). Again, the content agreement was rated by two researchers (M.S.H. and L.K.A.), and any disagreement was decided by a third rater (G.A.S.).
Factor Models
Correlated Models Using CBCL and SDQ
We used the four compatible constructs between the CBCL and the SDQ: internalizing, externalizing, attention/hyperactivity, and social problems (Achenbach & Rescorla, 2001; Kóbor et al., 2013). In the analysis aggregating CBCL and SDQ, questionnaires were considered a cluster (random effect) to model the nonindependence of standard errors.
Bifactor Models Using CBCL and SDQ
The bifactor model used the same factors as the correlated model, adding the general factor (p-factor) and constraining the specific factors not to correlate (orthogonal model).
Global Fit
Correlated and bifactor structures were modeled using confirmatory factor analysis (CFA). CFA was carried out using delta parameterization and weighted least squares with diagonal weight matrix with standard errors and mean- and variance-adjusted chi-square test statistics (WLSMV). To evaluate global model fit, we used root mean square error of approximation (RMSEA), comparative fit index (CFI), Tucker–Lewis index (TLI), and standardized root mean-square residual (SRMR). RMSEA lower than 0.060 and CFI or TLI values higher than 0.950 indicate a good-to-excellent model. SRMR ≤0.100 indicate adequate fit, and <0.060 in combination with previous indices indicates good fit (Hu & Bentler, 1999). However, some limitations of these cut-offs are warranted for our study. First, CFI/TLI tends to be higher (and RMSEA lower) with an increasing number of dimensions and response categories and a lower number of items, and second, the probability of correctly rejecting a mis-specified model decreases with increasing sample size (Clark & Bowles, 2018; DiStefano & Morgan, 2014; Marsh et al., 2004).
Testing the Performance of Item-Wise Harmonization Strategies
Factor Score Correlation Between Assessment Tools
After estimating each sample’s correlated and bifactor models, factor scores were extracted, and correlations among CBCL and SDQ scores were performed.
Factor Reliability
We used hierarchical omega (ωH) as a measure of factor reliability for factors extracted from the correlated and bifactor models (Rodriguez et al., 2016), ranging from 0 to 1. For the general factor, ωH is computed by dividing the squared sum of the factor loadings on the general factor by itself plus the squared sum of the factor loadings of all specific factors and residual error variance. For the specific factors, ωH is computed by dividing the squared sum of the factor loadings on the specific factor by itself plus the squared sum of the factor loadings of the general factors and residual error variance. Aside from ωH, we further estimated eight model-based reliability indices to supplementary evaluate the bifactor models, which are described in full in the Supplementary Material (page 2). Briefly, they were factor determinacy (FD—the correlation between the factor scores and the estimated factor), H índex (a measure of construct replicability), explained common variance (ECV), ECV of a specific factor due to itself (ECV-SS), ECV of a specific factor relative to the general factor (ECV SG), ECV of the general factor relative to a specific factor (ECV GS), percent uncontaminated correlations (PUC), and item explained common variance (IECV) (Dueber, 2017; Rodriguez et al., 2016). The construct can be interpreted as unidimensional when ωH is >0.8 and ECV and PUC are >0.7 (Rodriguez et al., 2016).
Questionnaire Factorial Invariance in Each Strategy
Once the CBCL-SDQ models fit were tested for each strategy, we tested factorial invariance (Meredith, 1993). Invariance testing would inform if CBCL and SDQ presented configural and scalar invariance, meaning that differences in factor scores were not due to different assessment tools. We first tested if the models were structurally similar (configural invariance). After that, we tested if items were informing symptoms at equivalent levels and equally correlated with the latent factors (scalar invariance) depending on the questionnaire. Invariance was tested with multigroup CFA (MG-CFA), establishing group equality in model configuration and thresholds and loadings using the option “model = configural scalar” in Mplus. Invariance is established by comparing global model fit indices between constrained models (B. Muthén & Asparouhov, 2002; Ploubidis et al., 2019). Invariance is indicated by ΔCFI < 0.01 supplemented by ΔRMSEA < 0.015 or ΔSRMR < 0.010 between models with increasing levels of constraints configural versus scalar (Chen, 2007; Meredith, 1993).
Root Expected Mean Square Difference
We modified the REMSD concept and formula from previous literature for our purposes (Thissen et al., 2015; Tulsky et al., 2019). Specifically, we used CBCL as a target measure (i.e., the reference questionnaire) and SDQ as a proxy measure (i.e., the questionnaire to be harmonized with the target measure). To calculate the REMSD, we estimated (a) the CBCL-SDQ factor models (correlated and bifactor) using the proxy (SDQ) and the target measure (CBCL) in one out of the two samples and (b) the same factor models using the target measure in both samples. REMSD is calculated with the formula below,
where pAtB refers to a subject’s factor score estimated in a model in which a proxy measure p was used in sample A and a target measure t was used in sample B, and tAtB refers to a subject’s factor score estimated in a model in which a target measure t was used in the sample A and B. Therefore, the difference between pAtB and tAtB represents the difference between using a proxy instead of a target measure in the same subject from sample A while modeled with subjects from a different sample B that uses the target measure. In this analysis, samples (BHRCS or HBN) were considered a “cluster” to model the non-independence structure of the data. Figure 2 depicts the calculation of the REMSD.

The Diagram Demonstrates the (1) Data Structure for (2) Factor Modeling, (3) Factor Score Extraction, and (4) Calculation of the Root Expected Mean Square Difference (REMSD). To Calculate the REMSD, We Estimated (1) a Factor Model Using the Proxy (SDQ) Measure for Sample A (i.e., It Could be Either BHRCS or HBN) and the Target Measure (CBCL) for Both Samples A and B. Afterward, the Subject’s Factor Scores From Sample A, Estimated Using a Proxy Measure, Were Subtracted From Their Factor Score While Using the Target Measure. Therefore, the Individual Difference Between These Two Scores Represents the Difference of Using a Proxy Instead of a Target Measure in the Same Subject While Modeled With Subjects From a Different Sample That Uses the Target Measure.
Statistical Analysis
NLP was carried out in Visual Studio Code (version 1.59.0) using the “all-mpnet-base-v2” pretrained model within the “SentenceTransformer” package (Reimers & Gurevych, 2019) in Python version 3.9.6. This pretrained model was designed as a general purpose model and was trained using more than 1 billion pairs; it mapped sentences to a 768-dense vector space to enable the semantic search.
All CFAs were carried out using Mplus version 8.6 (L. K. Muthén & Muthén, 2017) and implemented in R version 4.0.3 using the MplusAutomation package (Hallquist & Wiley, 2018), which was also used to extract factor scores generated in Mplus. Factor scores were estimated using expected a posteriori estimation. All bifactor reliability indices were calculated using the BifactorIndicesCalculator package in R (Dueber, 2017). The CBCL-SDQ correlated, and bifactor models used the assessment tool as a cluster to adjust the standard errors for the nonindependence of the data.
Pearson correlation was estimated and plotted using the rcorr function in the Hmisc package (Harrell, 2021). Then, factor attenuation-corrected correlations were calculated to understand if low correlations were due to measurement error. For that, we used Spearman’s disattenuation formula x′y′=rxy/(√rxx*ryy), where the correlation between x and y (rxy) is divided by the square root of the product of the omega hierarchical reliability index of a model using CBCL (rxx) and the omega hierarchical reliability index of the model using SDQ (ryy), which shows into which extent low correlations were due to measurement error.
The REMSD analysis used the sample (BHRCS and HBN) as a cluster to adjust for the nonindependence of the standard errors. Invariance was tested with multigroup CFA (MG-CFA) using the option “model = configural scalar” in Mplus. Code and Supplemental Tables can be found in Supplementary Material, page 3, and in https://osf.io/wnrp4 (Hoffmann et al., 2022).
Results
Harmonization Strategies
Overall, the null and the designed harmonization strategies resulted in similarly good global fit indices for the correlated and bifactor models (Table 1). The “completely at-random” and NLP-based strategies were applied to tested factor models in which CBCL and SDQ items were harmonized regardless of their theoretical construct belonging to the original questionnaire. The correlated and bifactor models successfully converged at the fourth “completely at-random” harmonized item set generation. Therefore, the item set presented in Figure 1 was used for correlated and bifactor models. All other models converged in their first estimation. Item harmonization by response frequency ranking in BHRCS and HBN is described in Supplemental Table S1. The NLP-based strategy did not present enough items to be included in the social problems factor, according to the cosine score (Supplemental Tables S2 and S3). Except for “completely at-random” and NLP, other strategies included items within the constructs of internalizing, externalizing, attention/hyperactivity, and social problems.
Global Fit Indices of Harmonized SDQ and CBCL Correlated and Bifactor Model.
Note. The harmonized CBCL-SDQ bifactor model included one general factor. The specific factors of the completely at-random model included four specific factors regardless their a priori belonging to a specific construct. Other models included items within the internalizing, externalizing and attention/hyperactivity, and social problems constructs. Social problems were not included in the NLP-based semantic harmonization due to no match between items of this construct among instruments. SDQ = strengths and difficulties questionnaire; CBCL = child behavior checklist; NLP = natural language processing; BHRCS = Brazilian high-risk cohort study for mental conditions; HBN = healthy brain network; RMSEA = root mean square error of approximation; CI = confidence interval; CFI = comparative fit index; TLI = Tucker–Lewis index; SRMR = standardized root mean square residual.
The 1:1 expert-based semantic item harmonization between CBCL and SDQ presented 95% agreement between raters (19/20 items, Supplemental Table S4), and the expert-based semantic harmonization with item parceling presented 74% agreement (37/50 items, Supplemental Table S5).
The Supplementary Materials describe each model’s factor structures for correlated (Supplemental Tables S6 to S17) and bifactor models (Supplemental Tables S18 to S29) in each sample.
Correlation, Reliability, and Invariance From the Harmonized CBCL-SDQ Models
Tables 2 and 3 present the impact of harmonization strategies on between-questionnaire factor score correlation, factor reliability, and questionnaires’ invariance testing obtained from the correlated and bifactor models, respectively.
Factor Correlation, Factor Reliability, and Invariance Testing Between CBCL and SDQ in BHRCS and HBN Samples Using Correlated Factor Model.
Note. Internalizing, attention, externalizing and social problems are not applied as meaningful constructs for the Complete at-random model. SDQ = strengths and difficulties questionnaire; CBCL = child behavior checklist; NLP = natural language processing; BHRCS = Brazilian high-risk cohort study for mental conditions; HBN = healthy brain network; NE = not estimated due to non-convergence.
Factor Correlation, Factor Reliability, and Invariance Testing Between CBCL and SDQ in BHRCS and HBN Samples Using Bifactor Model.
Note. Internalizing, attention, externalizing and social problems are not applied as meaningful constructs for the Complete at-random model. SDQ = strengths and difficulties questionnaire; CBCL = child behavior checklist; NLP = natural language processing; BHRCS = Brazilian high-risk cohort study for mental conditions; HBN = healthy brain network; NE = not estimated due to non-convergence.
In both samples, the correlated model resulted in a similar factor score correlation between the same factors using different questionnaires and factor reliability (Table 2). For the BHRCS, CBCL and SDQ harmonized factors increased from a mean level of 0.59 (completely at-random harmonization strategy) to 0.66 (expert-based semantic 1: parcel harmonization strategy). A similar pattern occurred for HBN, where between-questionnaires factor correlation increased from parcel harmonization strategy. A similar pattern occurred for HBN, where between-questionnaires factor correlation increased from 0.71 (completely at-random harmonization strategy) to 0.78 (expert-based semantic 1:1 harmonization strategy). Factor reliability presented a similar pattern. Correlated models using harmonized items in the BHRCS presented the lowest mean factor reliability using the NLP-based harmonization strategy (mean ωH = 0.63) and the highest when using expert-based semantic 1:1 harmonization (mean ωH = 0.79). For HBN, the lowest mean factor reliability was achieved using the completely at-random item harmonization (mean ωH = 0.56) and the highest for the expert-based semantic harmonization strategy (mean ωH = 0.77). It is worth observing that improvement in factor scores correlation between CBCL and SDQ and factor reliability is not substantial for the correlated models nor substantially different between samples (Table 2).
The bifactor model presented slightly different results (Table 3). For the BHRCS, CBCL and SDQ harmonized p-factor correlation increased from 0.60 (completely at-random harmonization) to 0.70 (expert-based semantic 1:1 harmonization), and specific factors increased from a mean correlation of 0.05 (random within construct harmonization) to 0.43 (expert-based semantic 1:1 harmonization). A similar pattern occurred for HBN, where the harmonized p-factor correlation increased from 0.69 (completely at-random harmonization) to 0.81 (expert-based semantic 1:1 harmonization). Specific factors increased from a mean correlation level of 0.19 (completely at-random harmonization) to 0.60 (expert-based semantic 1:1 harmonization). It is worth noting that questionnaire correlation was higher for the HBN sample and varied more widely depending on the item harmonization strategy in the bifactor models compared with the correlated factor models. Specific factor reliabilities were higher for the HBN sample than BHRCS in the bifactor models, which increased from the completely at-random model to the expert-based semantic 1:1 harmonization strategy (Table 3). Specific factors also presented improvement in other model-based bifactor reliability indices, such as FD, H, and ECV SG (Supplemental Table S30). These results mean that the specific factor scores originated from non-null strategies are (a) more suitable to be used in further analysis, (b) the factors are more likely to be reproducible, and (c) there is more variance to be explained by the specific factors over the general factor.
Factor attenuation-corrected correlations for the correlated (Supplemental Table S31) and bifactor models (Supplemental Table S32) are described in Supplementary Material, which shows to which extent low correlations were due to measurement error.
For the correlated models, scalar invariance was reached when the completely at-random, response frequency ranking and expert-based semantic item harmonization strategies were used in both samples (Table 2 and Supplemental Table S33). As for the bifactor models, scalar invariance was reached in both samples only when expert-based semantic harmonization was used (Table 3 and Supplemental Table S34). Hence, mean differences in factor scores while using other harmonization approaches between CBCL and SDQ items are due to other characteristics, such as a test–retest phenomenon, rather than a property of the questionnaire that impacts mean scores.
Comparing the Impact of Harmonization With REMSD
REMSD demonstrated that item harmonization strategies have little impact (small effect size) in the standardized factor score difference between using a proxy (SDQ) instead of a target measure (CBCL) in the correlated (see Table 4) and bifactor models (Table 5). Despite little overall improvement from the null models (i.e., items matched at-random) compared when any other strategy was used, the median REMSD is still higher than 0.375, meaning that factor scores using a proxy measure tend to deviate from the target measure in both correlated and bifactor models (upper results in Tables 4 and 5, respectively). Furthermore, the proportion of subjects who deviate more than 0.5 (or more than 1) standardized factor scores from the target measure when using a proxy decreased when any other strategy was used compared with null strategies (bottom results in Tables 4 and 5). A proxy versus a target measure presents a lower difference in HBN compared with BHRCS.
Root Expected Mean Square Difference of Using a Proxy (SDQ) and Target (CBCL) Measure in BHRCS and HBN Samples Using Correlated Factor Model.
Note. Internalizing, attention, externalizing and social problems are not applied as meaningful constructs for the Complete at-random model. Root expected mean square difference (REMSD) includes SDQ as proxy instrument and CBCL as target instrument. REMSD is the standardized factor score difference between using the proxy and the target measure. REMSD = root expected mean square difference; SDQ = strengths and difficulties questionnaire; CBCL = child behavior checklist; NLP = natural language processing; BHRCS = Brazilian high-risk cohort study for mental conditions; HBN = healthy brain network; IQR = interquartile range; SD = standard deviation.
Root Expected Mean Square Difference of Using a Proxy (SDQ) and Target (CBCL) Measure in BHRCS and HBN Samples Using Bifactor Model.
Note. Internalizing, attention, externalizing and social problems are not applied as meaningful constructs for the Complete at-random model. Root expected mean square difference (REMSD) includes SDQ as proxy instrument and CBCL as target instrument. REMSD is the standardized factor score difference between using the proxy and the target measure. REMSD = root expected mean square difference; SDQ = strengths and difficulties questionnaire; CBCL = child behavior checklist; NLP = natural language processing; BHRCS = Brazilian high-risk cohort study for mental conditions; HBN = healthy brain network; IQR = interquartile range; SD = standard deviation.
Discussion
Phenotypic data harmonization is crucial for the reproducibility and aggregation of different data sets. In the present study, we tested different ways to perform item-wise harmonization between two widely used questionnaires in child and adolescent psychiatry within correlated and bifactor model frameworks. We demonstrated that the between-questionnaire correlation within first-order (correlated model) and the p-factor (bifactor model) increased from moderate, in completely at random item selection process, to the highest level when an item matching strategy was used, especially for the expert-based harmonization approaches. The between-questionnaire factor score correlation derived from bifactor models’ specific factors changed from nonexistent, using random-selection strategies, to moderate, when using the 1:1 expert-based strategy. The p-factors were reliable for all harmonization strategies one adopts. Using the REMSD, we found that the factor score difference of using a proxy measure (SDQ) instead of a target measure (CBCL) was minimized when using harmonization strategies, but effect sizes were small. Thus, considerable differences in factor scores remained even after harmonization. Considering that the expert-based 1:1 semantic item harmonization strategy was the only one to provide invariant models across two different samples, the results indicate that this is the best for item harmonization using CBCL and SDQ.
This study was the first to compare different item-wise harmonization strategies with two null strategies (i.e., selecting a random item pool) in factor models used to integrate mental health data (Gondek et al., 2021; Hoffmann et al., 2022). The factor correlation between CBCL and SDQ increased mildly from the null to semantic harmonization strategies for the first-order factors (correlated factor models) and p-factors. Moreover, these correlations were little impacted by factor reliability as demonstrated by disattenuated correlations. Part of the moderate correlation between CBCL and SDQ might be due to reverse-code SDQ scoring used in some questions, especially for the attention and social problems dimensions which presented the lowest level of between-questionnaire correlation across harmonization strategies and samples. However, in the bifactor models, a between-questionnaire correlation for randomly harmonized specific factors was null and increased using harmonization strategies, especially for attention-specific factors. Nonetheless, disattenuated correlations demonstrated that low correlations are most likely due to the low reliability of the specific factors from the bifactor models. This might indicate that while using a bifactor model, item harmonization strategies impact factor reliability while integrating data sets using different questionnaires. As for the difference between using a proxy instead of a target measure (i.e., how much a study would lose from using a different measure in its sample), item harmonization strategies reduce this difference from a null strategy to a semantic harmonization strategy. Still, effect sizes are small for factors derived from correlated and bifactor models.
Although studies in developmental psychiatry and psychology have assumed the compatibility between constructs assessed by different questionnaires, studies have examined statistical approaches to test whether different questionnaires are measuring the same phenotype (Bauer & Hussong, 2009; Curran & Hussong, 2009; Gondek et al., 2021; McElroy et al., 2021). For example, the Closer-UK has aimed to harmonize unidimensional constructs, such as psychological distress, by using different questionnaires within and between British cohorts. The main strategy has been to harmonize items from different assessment tools by semantic content, evaluated by independent raters (Gondek et al., 2021; McElroy et al., 2021). This expert-based semantic item harmonization strategy has also been used in several IDA studies.
Curran et al. (Curran & Hussong, 2009; Curran et al., 2008) have proposed to use item response theory (IRT) to harmonize different assessment tools, encompassing similar and unique items in the same harmonized model. Other methods are also possible, such as the generalized linear and moderated nonlinear factor analysis, which is appropriate for including different scaling types (Bauer & Hussong, 2009), and the alignment method, which is appropriate for estimating group-specific means and variances without requiring complete measurement invariance (B. Muthén & Asparouhov, 2014). Bifactor integration model has also been proposed to generate a single phenotypic score while taking the specificities of different questionnaires into account (Luningham et al., 2019, 2020) These studies provide evidence for harmonizing assessment tools when the resulting harmonized item pools are equal in the number of items for each questionnaire. Other methods do not require an equal number of items, such as the IRT method proposed by Curran et al. (2008) and the bifactor strategy proposed by Luningham et al. (2019). Examining how these strategies would apply and behave in such analytical frameworks is beyond this study’s scope. Nevertheless, we provide evidence that, at the moment, the item selection strategies used by these studies are probably better than some alternative strategies, such as NLP and item parceling.
The NLP algorithms used in the present study are based on vectors trained with search databases, such as Google (Reimers & Gurevych, 2019). NLP has been used in many applications to date, as mentioned earlier (Beam et al., 2021; Grossman Liu et al., 2021; Kjell et al., 2019). However, rather than finding similarities in responses from participants, this study applied to find similarities between questions, which have not been much examined and compared with other strategies to date. Here we found that the selected NLP approach for item-wise harmonization did not perform better than the expert-based process. Future NLP algorithms could be fine-tuned to harmonize mental health questionnaires by training the sentence embeddings vector space with similar and dissimilar sentences (i.e., items). The performance of such algorithms may not be better than the expert-based process. Still, it perhaps could increase reproducibility as they do not depend on the researchers’ ratings (prone to many biases as humans) and can be timesaving. Moreover, tailored NLP algorithms could have advantages in terms of harmonization scalability as they could match items of multiple questionnaires simultaneously.
Some differences in factor correlation, factor reliability, and REMSD can also be attributed to differences between samples. For example, HBN, a help-seeking sample, presented higher factor correlation, reliability, and lower REMSD in all item harmonization strategies for CBCL-SDQ factor models. Previous literature demonstrated that specific factors from bifactor models tend to be more reliable while using clinical samples (Fernández de la Cruz et al., 2018; Watts et al., 2019), and we demonstrated that this might also be the case for smaller differences from using a proxy instead of a target questionnaire. Thus, samples and questionnaires should be considered a relevant source of discrepancies between harmonized scores (Fernández de la Cruz et al., 2018; Levin-Aspenson et al., 2021; Watts et al., 2019).
Several limitations of our study should be acknowledged. First, we used a help-seeking and a community study enhanced for high family risk for psychopathology. These results might change when using low-risk community samples. However, it is not likely that these differences could be substantial. Second, we used cross-sectional data throughout; the homotypic stability of constructs might not hold while comparing different assessment tools at different time points. Third, we tested one high-performance all-purpose NLP algorithm; tailored NLP algorithms could provide different results. Fourth, it is not possible to understand if the results attributed to expert-based item matching were due to being an expert or not, as we have not tested lay-person harmonization.
These results demonstrate that, within the same sample, the between-questionnaire correlation of first-order factors and p-factors is probably given by a positive manifold as little impact occurred in factor correlation and REMSD after an item harmonization strategy is applied when compared with null strategies (items harmonized at random). However, the 1:1 expert-based item-wise strategy to harmonize CBCL and SDQ outperformed other item harmonization strategies as it resulted in invariant models in both samples and is recommended as the most parsimonious strategy to further aggregate different data sets while using CBCL and SDQ. While using factor models, our study was the first to compare different item-wise harmonization strategies against a random process. Furthermore, we found that the factor score differences of using different measures are substantial but sample-dependent. Future studies could improve the NLP algorithm “fine-tuned” for mental health assessment tools, which could help reveal the upper bound of what is possible in terms of different questionnaires providing the same score (i.e., close-to-zero REMSD) and how much of any difference (non-zero REMSD) is due to test–retest measurement error.
Supplemental Material
sj-docx-1-asm-10.1177_10731911231163136 – Supplemental material for An Evaluation of Item Harmonization Strategies Between Assessment Tools of Psychopathology in Children and Adolescents
Supplemental material, sj-docx-1-asm-10.1177_10731911231163136 for An Evaluation of Item Harmonization Strategies Between Assessment Tools of Psychopathology in Children and Adolescents by Maurício Scopel Hoffmann, Tyler Maxwell Moore, Luiza Kvitko Axelrud, Nim Tottenham, Pedro Mario Pan, Eurípedes Constantino Miguel, Luis Augusto Rohde, Michael Peter Milham, Theodore Daniel Satterthwaite and Giovanni Abrahão Salum in Assessment
Footnotes
Methodological Disclosure
We report our sample size presents enough power to detect differences in factor analysis, all data inclusions, all manipulations, and all measures in the study.
Declaration of Conflicting Interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Luis Augusto Rohde has received grant or research support from, served as a consultant to, and served on the speakers’ bureau of Aché, Bial, Medice, Novartis/Sandoz, Pfizer/Upjohn, and Shire/Takeda in the last 3 years. The ADHD and Juvenile Bipolar Disorder Outpatient Programs chaired by Dr Rohde have received unrestricted educational and research support from the following pharmaceutical companies in the last 3 years: Novartis/Sandoz and Shire/Takeda. Dr Rohde has received authorship royalties from Oxford Press and ArtMed.Other author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was, and the RBC initiative is, supported by the United States National Institutes of Health grant R01MH120482-01. BHRCS was supported with grants from the National Institute of Development Psychiatric for Children and Adolescent (INPD) (Grants: CNPq 465550/2014-2, FAPESP 2014/50917-0.
Supplemental Material
Supplemental material for this article is available online.
