Abstract
Construct, method, and item bias are three levels of measurement bias (i.e., internal bias) essential for valid group comparisons. While many studies often focus on only one level of bias, an integrated perspective on bias is still missing, especially in longitudinal designs. The aim of this study is to address bias in an integrated manner, using four waves of data in the U.K. Longitudinal Household Panel Survey. Responses to the General Health Questionnaire (GHQ-12) from natives and two generations of immigrants were used to analyze the three levels of bias. While the basic structure of the GHQ-12 was stable across groups and time, item and method bias decreased with repeated administrations. Results were confirmed with a sensitivity test. The integrated results allowed for a distinction between temporal sources of bias that became smaller over time and sources affecting valid comparisons persistently. We discuss the implications for mental health assessment.
Keywords
In cross-group comparative studies, assessment of measurement bias has become imperative to assure equivalence of responses across groups (Van de Vijver & Matsumoto, 2011). When bias is present, group scores are affected by nuisance factors that jeopardize the comparability of the intended construct (Van de Vijver & Tanzer, 2004). In the past decades, the improvements in theories and statistical procedures have stimulated research to better understand bias and its causes. The aim of the present study is to understand the nature of bias and distinguish between temporal and systematic elements behind bias by investigating three levels of measurement bias (construct, method, and item bias) in a longitudinal study. The methodological goal of the study is to provide researchers with a strategy to combine different bias analyses, interpret outputs systematically in order to crystallize how well an instrument captures the intended trait level in the target constructs among respondents from different groups, and identify what exactly causes measurement bias across groups. A second, content substantive goal is to reveal and explain bias in a frequently used mental health measure, namely, the General Health Questionnaire (GHQ), in a cross-group context. In the following, we first review research on bias, and then we illustrate the necessity and incremental value in elucidating bias of the GHQ in the target population.
Defining Measurement Bias and Levels of Measurement Bias
Prior to the use of a measure for screening or evaluation, both internal and external validity of the measure should be ensured. This study focuses on the assessment of internal validity with a thorough check of measurement bias. Measurement bias (hereafter simply bias) refers to the presence of systematic differences in the measurement instruments that do not have the same meaning within and across groups (Poortinga, 1989). Three levels of bias are considered important for understanding how measures may be evaluated across different groups (Van de Vijver, 2011): (a) construct bias, which occurs when the construct measured is not identical across groups; (b) method bias, which refers to systematic differences in sample characteristics, administration conditions, or instrument properties; and (c) item bias, which occurs when an item has a different psychological meaning across groups. These levels of bias are vital in establishing a comprehensive picture of bias.
Quantitative procedures are available to assess these three levels of bias (Van de Vijver & Leung, 2011). Construct bias has been assessed with confirmatory factor analysis (CFA), where the operational model defining the construct measured by the instrument is tested across groups (Koch, Schultze, Eid, & Geiser, 2014). Among method bias factors, the analysis of response styles is one of the most important indicators, especially in the use of Likert scales. Response styles are defined as the systematic tendency to overuse certain answer options on some basis other than the target construct (He & Van de Vijver, 2013). Item bias is analyzed by differential item functioning (DIF). DIF analyses identify situations in which individuals with the same level on the target construct have different probabilities of giving a particular response depending on their group membership (Millsap & Everson, 1993). The presence of DIF has been related to problematic terms or expressions causing a different understanding of the item (Allalouf, 2003; Elosúa & López-Jaúregui, 2007), as well as to items’ content and contextual factors mediating the participants’ interpretations (Benítez, Van de Vijver, & Padilla, 2014).
Only a few studies have examined bias at more than one of these levels. A study by Meiring, Van de Vijver, Rothman, and Barrick (2005) assessed bias at all three levels in cognitive and personality tests in South Africa. They identified the presence of idiomatic expressions, sentences with double meaning, and cultural specificity of certain constructs as sources of bias (especially at the construct level). More recently, Ashley et al. (2013) applied CFA and DIF analyses to investigate bias in responses of patients with different types of diseases to a questionnaire about their illness perception. These authors found an improvement on the equivalence at the construct level after removing biased items. However, another study by Roomaney and Koch (2013) reported that removing items with DIF did not result in better structural equivalence when assessing the English and Xhosa versions of a visual analogies scale. Bias at construct and item levels can also be evaluated simultaneously with structural equation modeling (SEM) approaches. However, as Byrne and Van de Vijver (2010) indicated, this strategy is limited because the nature of the nonequivalence is unclear; it is not known whether nonequivalence is due to individuals or accumulated differences in parameters. In addition, the SEM approach is designed to provide comparability results for all the groups involved at once and details about specific pair-wise comparisons are missing. Thus, a coherent picture of bias at construct and item levels is lacking.
DIF analyses have also been combined with results from analyses of response styles. Wetzel, Böhnke, Carstensen, Ziegler, and Ostendorf (2013) concluded that correcting for response styles reduces the number of items with DIF. In another study by Benítez, Van de Vijver, and Padilla (2017), results from DIF, response styles, and CFA were integrated for understanding differences between mainstream and immigrant groups in a health survey in Spain. This study identified diverse sources of bias, such as specific aspects of items (e.g., ambiguous terms) or group characteristics (e.g., the mother tongue of participants), as well as the relative contribution of each level of bias.
The above-mentioned examples speak for the benefit of assessing different sources of bias simultaneously. However, these cited studies were all single wave studies, providing only a snapshot of how assessment of bias may be integrated. To our knowledge, very little is known about how a combination of these levels of bias changes across time, which is relevant since temporal and persistent sources of bias cannot be extracted in a cross-sectional design. Extending the integration of bias assessment to a longitudinal design could further uncover the stability and changes of bias, which facilitates distinguishing temporal from persistent bias, and contributes to the avoidance of bias as much as possible.
Measurement Bias in Longitudinal Studies
Longitudinal studies provide the opportunity to learn about the changes and the stability of attributes across time (Koch et al., 2014). From a methodological perspective, longitudinal studies also represent an attractive scenario to interpret statistical phenomena across time points. In terms of construct bias, longitudinal investigations have demonstrated the importance of analyzing measurement invariance before formulating conclusions about changes in attributes across time (Koch et al., 2014). Miche, Elsässer, Schilling, and Wahl (2014) evaluated measurement equivalence in longitudinal data and confirmed the consistency of psychometric properties of the scale across groups over time.
Longitudinal designs also provide information about the stability of response styles across time. Weijters, Geuens, and Schillewaert (2010) concluded that response styles are stable individual characteristics, as they found convergence in response styles extracted from different measures across two assessment occasions. However, until now the longitudinal evaluation of response styles has been rare, probably due to lack of suitable data for such analysis.
In contrast, item bias has been more frequently addressed in longitudinal studies. Makransky, Warrer, Torsheim, and Currie (2014) analyzed DIF in a scale about family affluence in Norway and Scotland, in which they found that the score increase in family affluence across time was partly due to the presence of item bias. These authors defined item parameter drift as the DIF incurred when individuals at different time points have different scores on specific items, despite having equal levels of family affluence. They proposed to model DIF by assigning group-specific item parameters. Park, Pearson, and Reckase (2005) evaluated DIF across diverse cohorts in a longitudinal survey and found that short sentences were systematically more sensitive to the presence of DIF. Relevant findings were also provided by Naumann, Hochweber, and Hartig (2014), who combined DIF analyses with evaluations of items’ difficulty to identify the impact of teaching on students’ performance. They reported how the combined strategy can separate effects due to the presence of bias from the real effect of the instruction. However, research on DIF has been in general more successful in developing and improving statistical procedures for its detection than on explaining its origin. Although some efforts have been made in the past years to understand causes of DIF (Allalouf, 2003; Benítez & Padilla, 2014; Elosúa & López-Jaúregui, 2007), more studies are needed to confirm previous results and inform the development of DIF across time. Therefore, the present study intends to overcome this limitation by integrating construct, method (response style in this case), and item bias in the analyses of longitudinal data.
The Present Study: The GHQ Across Groups in a U.K. Panel
Studies on bias are necessary before an instrument is used for comparing different groups. Previous studies have demonstrated the impact of bias at different levels with various procedures, mainly with cross-sectional data. However, it has not yet been explained how levels of bias are related and how stable they are across time. Therefore, the aim of the present study is to showcase an integrated approach to exploring temporal and stable sources of bias across groups and across time. We intended to uncover aspects that provoke unintended measurement differences between groups and to connect them with temporary and/or stable elements across administrations. Specifically, we evaluated to what extent procedures for analyzing bias are complementary for understanding temporal and stable sources of bias. To this end, we applied procedures habitually used for each source of bias. Criteria for selecting the specific procedures are explained in the “Method” section.
We made use of the first four waves of Understanding Society: the United Kingdom household longitudinal study (https://www.understandingsociety.ac.uk/) coordinated by the Institute for Social and Economic Research at the University of Essex. This survey includes diverse measures for assessing health and quality of life (QoL) issues, administered through paper-based administrations during personal interviews in the first two waves, and through computerized administration in the latter two waves.
We used a QoL measure as it is a relevant topic, the evaluation of which is frequently included in international projects searching for differences between groups (e.g., European Values Study, European Quality of Life Survey, or World Values Survey). Among the QoL measures, the GHQ is one of the most frequently used tools applied in clinical contexts as a well-being indicator (Abubakar, Alonso-Arbiol, Van de Vijver, Murugami, Mazrui, & Arasa, 2013), for measuring positive mental health (Y. Hu, Stewart-Brown, Twigg, & Weich, 2007), and for detecting psychological disorders such as depression and anxiety (Baksheev, Robinson, Cosgrave, Baker, & Yung, 2011) or psychological distress (Davis, Galyer, Halliday, Fitzgerald, & Ryan, 2008). In addition, the GHQ is administered as part of diverse surveys around the world, such as the Australian National Survey of Mental Health and Well-Being (Australian Bureau of Statistics, 1997), the second Dutch National Survey of General Practice (Netherlands Institute for Health Services Research, 2001), the Spanish Health Survey (Spanish National Statistics Institute, 2006), or the Scottish Health Survey (Scottish Government, 2010). Therefore, the extensive use of the GHQ in general and clinical research fields prompts researchers to check bias in this tool first. Previous studies pointed to ambiguous response categories on negative items or the multiple scoring procedures as main bias sources (Rey, Abad, Barrada, Garrido, & Ponsoda, 2014).
We aim to investigate stable and temporal sources of bias of the GHQ in group comparisons. In terms of participants, Understanding Society provides detailed information about demographic characteristics of participants, which is relevant for placing the assessment of bias in a cross-group context. The degree to which minority group members fit in the larger society, described as the mental health of immigrants in a society, has been considered an important indicator of integration. A number of studies reported worse mental health among immigrants compared with mainstreamers in the United Kingdom (e.g., Huang & Spurgeon, 2006; Krause, Rosser, Khiani, & Lotay, 1990). Within immigrant groups, second-generation immigrants seem to report lower mental health compared with first-generation immigrants (Nandi, Luthra, & Benzeval, 2016); whereas other studies found underestimation of mental distress among minority group members attributed to cultural influences on conceptualizations and expressions of distress (Williams, Eley, Hunt, & Bhatt, 1997). The mixed findings could be due to bias in the mental health measure across groups, as immigrants and natives may have different styles in presenting themselves in survey responding (He & Van de Vijver, 2013). Therefore, studying bias longitudinally in different groups of immigrants will provide insight into the types of elements provoking differences between them.
Specifically, three levels of bias are evaluated following an integrated approach. Measurement invariance as a check for construct bias, response styles as an indicator of method bias, and DIF as item bias were analyzed across four waves of the survey in comparing participants who were natives and who were first- or second-generation immigrants.
The sequence of bias evaluation (i.e., construct, method, and item bias) follows the logic of assessing more general aspects of bias to more specific aspects of bias, as the more specific aspects of bias (i.e., item bias) hinges on the congruence of the items on the construct, and all items are uniformly affected by response styles (i.e., method bias). Results from the three separate analyses were integrated to learn about both stable and temporary sources of bias. A sensitivity study was later conducted, using a different categorization of group memberships for refuting previous results and complementing interpretations. Finally, with the convergence of findings from the main study and the sensitivity check, we make robust conclusions on the development of bias across time and groups. Remedies to avoid or eliminate bias are provided.
Method
The study was designed to showcase the integration of different bias assessments to uncover/avoid bias and enhance comparability of data across groups. Two sets of analyses were performed using the same methodologies with the different categorizations of group memberships. First, in the main study participants were categorized by their status as native inhabitants, first-, or second-generation immigrants. In the sensitivity study, the comparison was made according to participants’ ethnocultural group memberships. As both sets of analyses resulted in very similar patterns of bias, we report the main study and provide the details of the second set of analyses in Appendix 1.
Participants and Procedure
A total of 102,679 participants were involved in the first four waves of the Understanding Society project. However, due to the design of the project, not all the participants responded to all the scales in every wave. The GHQ-12 questionnaire was administrated to 16,050 respondents across all four waves, and these participants were hence selected to be part of the present study. The logic behind that decision was to ensure that four waves of responses were present for all the participants and that all the participants included had exactly the same experience in terms of the exposure to the GHQ-12 so that different administrations of the GHQ-12 were capturing comparable changes.
A total of 12,730 participants from the 16,050 participants were classified as natives, whereas the remaining respondents were then classified either as first-generation immigrants (n = 1,736), or second-generation immigrants (n = 1,584). According to the panel, Natives were those originally from the United Kingdom, or descendant from migrants who arrived three or more generations ago, whereas first-generation immigrants were born in a different country and immigrated to the United Kingdom, and second-generation immigrants were born in the United Kingdom with at least one parent was born outside the United Kingdom. To balance the sample size in each group, a random subsample of 1,800 participants was selected from the native sample. Sample sizes were balanced across the three groups to reach optimal conditions and increasing the power of the bias detection method as shown in the simulation study by Carvajal and Skorupski (2010). Participants with missing responses were removed from the data set after confirming the absence of associations between demographic characteristics of participants and the fact that they had not responded to items in the scale. This brought the sample for the current study down to 5,120 participants.
Table 1 presents the demographic characteristics of participants in the study, which were comparable across groups.
Demographics of Participants.
Educational level was divided into low (primary school or lower), intermediate (up to secondary school), and high (university or beyond).
Instruments
We used the GHQ-12, originally designed by Goldberg (1972), to assess mental health problems at the community level and in nonpsychiatric settings. The GHQ-12 is a well-established measure that comprises 12 items evaluating mild psychiatric disturbance through changes in affective and somatic symptoms. Respondents are asked to answer statements about how they have been feeling recently on a 4-point Likert-type scale. The GHQ-12 consists of six negative-wording items rated from 1 (Not at all) to 4 (Much more than usual), and six positive-wording items rated from 1 (Better/More so than usual) to 4 (Much less than usual). Poor mental health is indicated by higher scale scores.
Analyses
Analyses were conducted in different phases. First, dimensionality and reliability were tested to ensure adequate psychometric properties in general in this measure. Then, bias at the construct, method, and item levels was evaluated with different procedures: Each analysis was independent and findings were interpreted in an integrated manner later. Choosing different procedures to evaluate different levels of bias instead of using the SEM approach ensures that each analysis informs one single and pure source of bias, and helps identify how nonequivalence appears in each level. Moreover, results from the four waves were observed independently for method and item bias (c.f., the construct is checked across all four waves, as this informs whether method and item bias are indeed affecting the same construct across waves). In this way, we can obtain information on how specific bias changed across time and what kind of elements affected the presence of bias within each source (construct, item, and method). Finally, commonalities between the three levels of bias analyses were checked in two different ways: looking at the three sources for each wave, and looking at changes occurred across time. This systematic review allowed for the extraction of integrated conclusions about elements provoking bias and their stability across time.
Preliminary Analysis
Before conducting bias analyses, the dimensionality and reliability of the scale were assessed, as unidimensionality is a prerequisite for conducting construct and item bias analysis. Unidimensionality was accepted when the percentage of explained variance of a one-factor solution was above 40% (Carmines & Zeller, 1979). We also checked the scree plot, and results confirmed that the GHQ-12 is unidimensional. In addition, analyses of construct bias (described below) provided evidence on unidimensionality. Cronbach’s alpha values equal to or greater than .70 were considered appropriate in terms of reliability of the scale (Nunnally, 1987). According to discussions about the GHQ-12 dimensionality in the literature, other assumptions were tested in the following sections.
Construct Bias
We assessed measurement invariance as an indication of lack of construct bias in multigroup confirmatory factor analysis (MCFA), using robust maximum likelihood estimation in MPlus (Muthén & Muthén, 2007) in a single model across all four waves. Essentially, the type of modeling used in this analysis is referred to as a latent state model, which is the simplest model for assessing changes in the state across time, and in the case of this study is moderated by groups (Koch et al., 2014). This single model was considered more appropriate compared with individual evaluations within waves as it provides information about the construct measured across groups and across time. Therefore, the model informs about whether or not the same construct is being captured across evaluations.
The unidimensional model originally proposed for the GHQ-12 was tested. As we considered nested models, we assessed the significance of the chi-square (χ2), which indicates how well the models fit the data. The first criterion to determine the presence of bias across groups was the difference in the χ2 between successive models in the hierarchal set; the change in χ2 values should not be significant from a less restrictive to a more restrictive model. Accepting more restrictive models allows establishing higher levels of equivalence across groups (for further information, see Van de Vijver, 2011). In this study, the decrease in the normed χ2 was used to avoid the impact in χ2 values habitually found for large sample sizes. The normed χ2 divides the χ2 by the degrees of freedom (df). Values of normed χ2 between 2 and 5 indicate reasonable model fit, with a value below 2 indicating good fit (Bollen, 1989).
In addition, the comparative fit index (CFI) was used as it is not sensitive to sample sizes in nested models and therefore provides the best indication of the relative improvement of a proposed model against previously estimated models. Generally, values >.90 are considered acceptable, with values >.95 considered good for the baseline and subsequent models (L. Hu & Bentler, 1999). The more restricted model is considered good if the change of CFI is .01 or less (Milfont & Fisher, 2010). Finally, we used predictive fit indices, that is, the Akaike information criterion (AIC; Akaike, 1987) and the Browne–Cudeck criterion (BCC; Browne & Cudeck, 1989), to compare the nested models. That is, when comparing models, the model with the lowest AIC and BCC is considered more parsimonious (Kline, 1998).
Figure 1 illustrates the model estimates across groups and measurement occasions. In the configural model assessing the similar structure across groups and time, we tested a model with a single factor where all items loaded on this factor. We allowed error variances to reflect the item dependencies across groups and waves (e.g., e1 was correlated with e13, e13 correlated e25, e25 correlated with e37). We correlated all negatively worded items within each wave, taking into consideration the possible impact of negatively worded items across time (indicated by the staggered lines in Figure 1). This is referred to as correlated traits and correlated methods (Motl & DiStefano, 2009). In the metric invariance model, we set the regression weights for each item to be equal across groups and time. In the scalar invariance model, we set item intercepts to be equal across conditions (Milfont & Fischer, 2010).

Estimates in models across groups and times.
Response Styles (Method Bias)
One source of method bias was evaluated by extreme response style in the GHQ-12 measure. Even though response styles are only one indicator of method bias, we believe that in this Likert-type scale measure, it can provide insight into instrument-related bias. An extreme response style index was extracted from the 12 items in GHQ-12 in each wave. The original responses of 1 and 4 on these GHQ-12 items were recoded as 1 (an indication of extreme response style), and original responses of 2 and 3 were recoded as 0 (absence of extreme response style). The mean score of the 12 recoded items was taken as the extreme response style index. Although previous studies on extreme response style revealed that it varies response formats (e.g., more response options elicit lower levels of extreme response style), extreme response style is also proven to be a relatively stable style across item content domains and across time (Weijters et al., 2010). So, we believe that the extreme response style index extracted from the 12 items can represent the general tendency to use the end points of response options. After averaging the 12 recoded item scores, it can have sufficient variations and approximates an interval variable.
Differential Item Functioning (Item Bias)
DIF procedures can be roughly split up into three types, depending on whether the procedures use contingency tables (simplest models), SEM (linear models), or item response theory (logistic models) (Ferne & Rupp, 2007; Lai, Teresi, & Gershon, 2005; Van de Vijver & Leung, 2011). Current development in DIF detection seems to suggest that item response theory-based approaches produce the best estimates, therefore in this study we applied the Lui–Agresti estimator, which proves to be appropriate when analyzing polytomous items (Penfield & Algina, 2003). To extract information about DIF across response alternatives we followed a differential step functioning framework, as implemented in the DIFAS program (Penfield, 2005).
The presence (significance) of DIF was determined with values of the standardized Liu–Agresti cumulative common log-odds ratio (LOR Z) estimators of the common odd ratio across all k-strata, which were deemed significant if the value was smaller than −2 or larger than 2 (Penfield, 2005). The total scores on the GHQ-12 were used as the matching variable when making pair-wise comparisons between a reference and a focal group. All the possible pairs were compared across waves. Analyses were carried out with native participants as the reference group when they were part of the pair-wise comparison; in the other comparisons, the second-generation immigrant group was the reference group.
The Liu–Agresti common log-odds ratio DIF effect estimator was used to identify effect sizes at the item level (Miller, Chahine, & Childs, 2010). Effect sizes were interpreted following the ETS classification: with an absolute value lower than .43 indicating small DIF, an absolute value between .43 and .64 was taken as moderate DIF, and an absolute value larger than .64 as large DIF (Zieky, 1993). In addition, the estimator provides information about the direction of DIF with positive values indicating DIF in favor of the reference group and negative values indicating DIF in favor of the focal group.
Results
Evidence on reliability and dimensionality of GHQ-12 was obtained across groups and waves. Table 2 shows Cronbach’s alpha values and percentages of explained variance on the first factor across the 12 conditions.
Reliability and Dimensionality Indexes.
As Table 2 shows, both reliability and dimensionality indicated the adequate psychometric properties of the scale. In addition, in the scree plot one dominant factor emerged: The first three eigenvalues in the principal component analysis in Wave 1 were 5.448, 1.212, and 0.811; in Wave 2 were 5.838, 1.229, and 0.752; in Wave 3 were 5.694, 1.201 and 0.814; and in Wave 4 were 5.820, 1.199, and 0.783, respectively. Therefore, requirements for meaningfully analyzing bias were satisfied.
Construct Bias
As Table 3 shows, the scalar invariant model was the most parsimonious model. This indicated that the basic structure of the model, the factor loadings, and intercepts were comparable across the different groups. The acceptance of the scalar invariance model confirmed that the GHQ-12 was a unidimensional construct (taking into consideration the intercorrelations between negatively worded items), and the structure and metrics were comparable across time and groups.
Measurement Equivalence Across Groups Across Four Waves.
Note. TLI = Tucker–Lewis index; CFI = comparative fit index; RMSEA = root mean square error of approximation; SRMR = standardized root mean residual; AIC = Akaike information criterion; BIC = Bayesian information criterion.
p < .001.
Method Bias: Response Styles
Before analyzing response styles, scales were recoded as explained in the Analyses section. The recoded scale had fairly high values of reliability with values of Cronbach’s alpha of .74, .76, .76, and .76 in each wave, respectively. Then four multivariate analyses were carried out with the extreme response style indexes as dependent variables and group membership as the independent variable. 1 The overall test was significant; Wilks’s Lambda (8, 10224) = .99, p < .01, although the differences across groups were extremely small, with η2 of .004. Table 4 presents the group means of extreme response style in each wave.
Group Mean Comparison of Extreme Response Style in Each Wave.
Note. Means with different subscripts are significantly different in Bonferroni post hoc comparisons.
As Table 4 shows, significant differences between groups occurred only in the first two waves but these group differences became smaller across time. Specifically, in the first two waves, there were significant differences between first-generation immigrants and the other two groups, whereas the difference was no longer evident in the following waves. Therefore, extreme response style did not seem to have a severe impact on cross-group comparisons in the GHQ-12 measure in the U.K. household panel data.
Item Bias: Differential Item Functioning
Items flagged as having DIF were determined with the LOR Z values. Table 5 presents effect size values obtained for each item when comparing natives with first- and second-generation immigrants across the four waves. Values in bold pointed to items that were flagged as having DIF. Group presented first in each column was acting as the reference group.
DIF Results Across Waves and Migration Generational Groups.
Note. Values in bold indicate items having DIF. DIF = differential item functioning; N = natives; F = First generation of immigrants; S = Second generation of immigrants.
As Table 5 shows, the number of DIF items decreased across waves, although some of the items were repeatedly flagged. In all the comparisons, there was a decrease in the number of DIF items with repeated administrations, and this was particularly the case when comparing natives and first-generation immigrants. The number of DIF items differed across comparisons, with the largest number of items affected by bias when comparing natives and second-generation immigrants. In this comparison, the number of items flagged in the last wave was still large (four items). Effect sizes were always small, so DIF either remained at the same level or reduced, but it never increased.
In addition, some consistency in patterning could be highlighted from Table 5. First, Item 3 (“playing useful part”) was never flagged as having DIF across groups; thus, it seemed to be the best item when comparing natives and immigrants of different generations. In contrast, Item 6 (“overcome difficulties”) was identified in all the comparisons in all the waves with the largest number of detections. Regarding the direction of DIF, Item 6 was always in favor of the reference group (i.e., more strongly endorsed by the reference group). There were also items appeared to be biased for specific groups. For instance, Item 4 (“making decisions”) was mainly disfavoring second-generation immigrants; while Items 5 (“feel under strain”) and 9 (“feeling unhappy”) were flagged only when comparing first- and second-generation immigrants.
To summarize, Item 6 showed the highest number of differences when comparing groups across waves, followed by Items 10 and 4. The three items were flagged as having DIF across waves and persisted throughout to the last two waves. Items 5 and 9 appeared problematic mainly when comparing first- and second-generation immigrants but the differences disappeared after the second wave.
Discussion
There are two aims of the study. Methodologically, we showcase an integrated approach of bias assessment that provides a framework to comprehensively assess bias at the construct, method, and item levels and the development of bias across time. Substantively, the analysis of bias in the GHQ across various groups in four waves of the U.K. panel data allows us to formulate general conclusions about the nature and characteristics of bias in this measure. We first discuss the methodological merits, and then we zoom in findings on the GHQ, which can contribute to better assessment of mental health using this measure in future research.
The Methodological Merits of the Integrated Approach
There is no overstressing how important it is to first assess bias and ensure comparability of data before any comparative inferences are drawn. The extensive experience accumulated in cross-cultural research in the past decades has provided a host of psychometric tools to assess bias in cross-sectional studies (e.g., Van de Vijver & Leung, 1997) and such tools can also be applied in longitudinal studies (e.g., Bowers et al., 2010). This study proposed to assess bias at construct (using MCFA), method (checking response styles), and item levels (studying DIF in an item response theory–based approach) across groups with longitudinal data, and integrate each analysis to shed light on (a) elements related to the presence of bias in each source, (b) elements that remained stable across time, and (c) elements with different developments. Such analyses can reveal the nature and the stability of elements causing bias. Our approach provides a new framework for systematizing results from different sources, which facilitates understanding where and how stable bias is in the measurement instrument. Researchers are encouraged to make full use of longitudinal data and systematically study measurement bias before comparing groups.
Bias Detection in the GHQ
Based on our analysis, there is evidence that the GHQ-12 is measuring the same construct across groups and time. As results from MCFA indicate, the construct is essentially invariant, however, bias at method and item level may still need to be investigated further. At the method and item levels, a general trend of reduction of bias with repeated measurements was observed. In the case of method bias, the cross-group differences in extreme response style disappeared after the second wave when comparing natives, first and second generations of immigrants; these differences were reduced across waves but were still present until the last wave when comparing ethnocultural groups (see details in Appendix 1). At the item level, the number of items with DIF became smaller across all pair-wise comparisons in later waves. This fact suggests that more accurate and valid comparisons are expected in later administrations than in initial ones, and therefore, potential cross-group differences on the target constructs may be more clearly established in later waves.
Our central finding is the stability of the construct and the decrease of response style differences and item bias across time. A few speculated reasons are listed below. First, learning effects over time can be attributed to the decrease of bias (Ackerman, 1987). Addressing the task repeatedly might affect the understanding of the content and improve the performance over time. For instance, Items 5, 9, and 12 (see Table 5), flagged as having DIF in first waves, were never identified in the last one, which might be due to that respondents reached a similar understanding of the items in the last wave. Second, it is possible that acculturation processes are promoting equivalent interpretations of items across groups. The reduction of bias may indicate that minority groups became more adapted to the mainstream cultures across time, thus fewer differences in the understanding of items are detected in later waves. Third, language difficulties can be behind differences in the items’ interpretation. Even though all the participants have the U.K. citizenship, differences in the language comprehension could still provoke differences in interpretations. In this case, the decrease of bias could be explained by the improvement of the language proficiency after some years residing in the country. Finally, the change in the administration mode could influence the reduction of bias from the second to the third wave. Because of the fact that GHQ-12 was self-completed on paper in an interview during the first two waves and as an online survey during the last two waves, the drop of bias at method and item levels could mean that the electronic mode provokes fewer differences across groups.
Despite the general diminishing of bias over time, we also found items in which bias was never present and items with stable bias that were resistant to the passing of the time. On one hand, Item 3 was never detected when comparing participants according to their migration group and there was only one detection when comparing ethnocultural groups. This item asks participants if they perceive themselves as playing a useful part in life. On the contrary, Items 4 and 6 were consistently found to be affected by DIF (Item 10 was consistently detected in the main study but in the sensitivity study, where DIF was present only when comparing White and Asian but not in other comparisons). Item 6, which asks participants whether they had felt that they “couldn’t overcome [their] difficulties” is a negatively worded item with a negative auxiliary verb (couldn’t), which makes the task more difficult for participants. Numerous studies have advised against using negations, as negations require additional cognitive load and make the judgment problematic (Weijters & Baumgartner, 2012). Actually, Item 6 has been previously indicated as problematic in comparisons between mainstreamers and immigrants groups who responded to GHQ-12 as part of a health survey in Spain (Benítez et al., 2017). Therefore, the fact that this item stem includes a negation, which is the only unique characteristic compared with all other items, could be responsible for the systematic DIF detection across time and groups. Regarding Item 4, no distinctive characteristics were observed as it is a positive-worded item asking about the ability to make decisions. It is possible that the types of decisions considered differed across groups, causing differences in the interpretation process.
In addition, Item 9 (“feelings of happiness”) was systematically detected when comparing first- and second-generation immigrants, which echoes the wide debate about the definition of “happiness,” which seems to be understood differently between respondents of different cultural background (Uchida, Norasakkunkit, & Kitayama, 2004). It is possible that immigrants of different generations have rather different standards for happiness. Item 12 (“feeling happy”) was also connected to happiness and it exhibited item bias between natives and both groups of immigrants but only in the first wave. These results indicate that abstract concepts such as happiness are fluid and subject to idiosyncratic interpretations, probably also influenced by the acculturation processes and experiences of immigrants. In summary, it seems that assessed areas are playing a role in the bias presence, as those items asking about feeling and cognitive tasks were more often flagged as having DIF than those asking about behaviors.
Structural characteristics of items were also relevant but they did not influence DIF in the expected direction. Based on the previous discussion about positive and negative-worded items in the GHQ-12, item wording was considered a potential threat in previous studies (Weijters & Baumgartner, 2012). However, the effect of the structural characteristics was mainly related to the presence of explicit negations in the item stem. The impact of the items’ length proposed by Park, Pearson, and Reckease (2005) was not confirmed, possibly because the GHQ-12 items did not differ much in length. Nevertheless, there is clear indication of negation being problematic, which should be revised to minimize bias.
Implications on Mental Health Assessments
This study has important implications for mental health assessments across groups. It illustrates how bias across groups can originally appear as a consequence of different degrees of adaptation to specific contexts. However, this original bias is alleviated with time, probably because of learning or acculturation processes of minority groups, which nurture such groups to become more similar to the natives. Nevertheless, there are sources of bias which are not neutralized over the passing of time. This study shows how the “surviving bias” is related to “subjective” estimations and to inadequate understanding of negative constructions in items. Therefore, the study indicates that all the items of the GHQ-12 are not equally useful when comparing groups. To resolve this issue, changes could be applied to adapt content to more “observable” issues and to avoid combinations of negations, negative wording, and reversed response options. Removing biased items is also a habitual approach; however, it would not be recommended in the case of the GHQ-12 because of various reasons: The intended construct in GHQ-12 is adequately captured in the structural sense and it is equally measured across groups, as construct bias analyses have shown; each item is useful for contributing to the construct measure, so removing items would lead to an incomplete representation of this construct; and removing items with DIF does not always result in better structural equivalence, as Roomaney and Koch (2013) explained. In any case, the GHQ-12 seems to be an appropriate tool for assessing the intended construct, so modifications should focus on making items more equivalent. From an applied perspective, this study helps researchers and professionals to understand advantages and limitations of the GHQ-12. For instance, new formulations for problematic items can be proposed and tested, and conclusions can be made based on equivalent items when comparing groups.
Conclusions and Limitations
The aim of the paper was to follow an integrative approach to assess bias across groups and across time. We conducted separate analysis at construct, method, and item levels with the four waves of the Understanding Society Survey. Three sets of analysis: CFA, response style construction and comparison, and an item response theory–based DIF analysis were presented. Researchers can make use of these psychometric tools and integrate results to better understand bias in their target measures. Our results indicate that the construct of the GHQ appears to be stable and bias in terms of response style and DIF is reduced across time. The decrease in response style differences and DIF may be explained by the learning effect, the acculturation process, the language proficiency, and the administration modes. The persistent bias seems to be a result of complexity in item wording, especially negation in item stem. Therefore, adaptation of problematic GHQ items is in need.
Limitations of the study include methodological decisions made while preparing and analyzing data. Only participants responding to the four waves were selected and analyses were based on the immigration and ethnocultural group memberships. The deletion of cases with missing data might introduce certain selection bias; however, it is a decision necessary to study the longitudinal development of bias; moreover, measurement bias from the three levels is believed to be more prominent compared with potential selection bias. Another limitation is that the three levels of bias are only assessed on the GHQ-12, therefore, the generalization of findings awaits future research on other constructs. Future research can extend the comparisons to other constructs, other instruments, and more diverse groups. Once the internal validity is ensured, links with external measures for predictive validity should be checked. In addition, qualitative procedures can be applied to uncover the communicative and cognitive processes behind the presence of bias. In terms of the GHQ-12, future studies can focus on proposing a new version where elements causing bias are modified for obtaining more comparable measures across groups while maintaining sufficient coverage of the construct.
Footnotes
Acknowledgements
The research leading to these results has received support under the European Commission’s Seventh Framework Programme (FP7/2013-2017) under Grant Agreement No. 312691, InGRID—Inclusive Growth Research Infrastructure Diffusion.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by a visiting grant from the Integrating Expertise in Inclusive Growth (InGRID). Grants references: c17-13, c17-16, and c17-17.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
