Abstract
One major threat to revealing cultural influences on psychological states or processes is the presence of bias (i.e., systematic measurement error). When quantitative measures are not targeting the same construct or they differ in metric across cultures, the validity of inferences about cultural variability (and universality) is in doubt. The objectives of this article are to review what can be done about it and what is being done about it. To date, a multitude of useful techniques and methods to reduce or assess bias in cross-cultural research have been developed. We explore the limits of invariance/equivalence testing and suggest more flexible means of dealing with bias. First, we review currently available established and novel methods that reveal bias in cross-cultural research. Second, we analyze current practices in a systematic content analysis. The content analysis of more than 500 culture-comparative quantitative studies (published from 2008 to 2015 in three outlets in cross-cultural, social, and developmental psychology) aims to gauge current practices and approaches in the assessment of measurement equivalence/invariance. Surprisingly, the analysis revealed a rather low penetration of invariance testing in cross-cultural research. Although a multitude of classical and novel approaches for invariance testing is available, these are employed infrequent rather than habitual. We discuss reasons for this hesitation, and we derive suggestions for creatively assessing and handling biases across different research paradigms and designs.
The truth on this side of the Pyrenees, error on the other side.
In all areas of psychological inquiry and other social sciences, culture-comparative research is growing in popularity. This popularity may be steered by globalization, internationalization in teaching, or intercultural awareness. As can be seen in Figure 1, there has been a steep increase in “cross-cultural” research over the last decades. Cultural influences on our behavior, feeling, and thinking are considered profound and cross-cultural differences are thought to be of sizable magnitude. Research is often driven by assumptions of differences with designs aimed at detecting differences in responses across samples from different cultures. However, score differences across cultures can be severely misinterpreted if comparability is lacking. Reported cross-cultural similarities and differences may be questioned without a solid methodology to demonstrate that systematic measurement error or measurement biases are not confounded with targeted cross-cultural variations.

Google Scholar and PsycINFO hits since 1990.
Comparisons require assurance of measurement comparability. Otherwise, apples are compared with oranges and comparative conclusions are on a shaky ground. A multitude of methodological techniques has been suggested in the past decades providing guidelines for conclusive research within the quantitative domain. These techniques help establish measurement comparability to draw valid inferences about cross-cultural differences and rule out alternative explanations due to bias. As the prerequisite for any cross-cultural comparisons, the demonstration of measurement comparability has been promoted and advocated for many years (e.g., Steenkamp & Baumgartner, 1998; Vandenberg & Lance, 2000; van de Vijver & Leung, 1997). However, Figure 1 shows that the search terms that capture cross-cultural comparability (i.e., “measurement invariance,” “equivalence,” and “bias” in addition to the generic term “comparability”) receive little attention despite the increased interest in cross-cultural research over time.
This article reviews the available psychometric tools to test measurement comparability, assesses how well these tools are being used, and presents ways forward to ensure systematic measurement errors (i.e., bias) are not confounded with targeted cross-cultural similarities and differences. We first introduce the framework of bias and equivalence/invariance and an overview of available methodological techniques to test measurement comparability. We then illustrate the current utilization of these tools in published cross-cultural psychological research from 2008 to 2015. We close with suggestions and “new-generation” developments in bias detection and research designs that can help future research improve cultural comparability.
The State-of-the-Art in Assessing Cross-Cultural Comparability
What Are Bias and Equivalence and Why Are These Important for Cross-Cultural Contributions?
Cultural phenomena can be studied in various ways: from an indigenous (e.g., Yang, 2000), cultural (e.g., Shweder & Sullivan, 1993), and cross-cultural perspective. The cross-cultural perspective refers to research devoted to explaining similarities and differences across cultures in affect, cognition, and behavior. However, this field is riddled with uncritical acceptance of observed differences and overgeneralization as consequences of insufficient attention paid to measurement properties. The framework of bias and equivalence serves as a backbone for a rigorous methodology in cross-cultural research.
The taxonomy of bias
Bias refers to systematic errors in measurement that threaten the validity of cross-cultural research (van de Vijver & Leung, 1997, 2000; van de Vijver & Poortinga, 1997; van de Vijver & Tanzer, 2004). A comparison across cultures is biased when it is not possible to accurately interpret the observed differences. In other words, differences found and conclusions drawn may be meaningless. Three types of bias can be distinguished: construct bias (theoretical concept), method bias (sampling, instrument, and administration bias), and item bias (e.g., differential item functioning [DIF]). In the following, we briefly describe each bias and relate them to comparability (i.e., invariance and equivalence 1 ).
Construct bias
Construct bias is present if the underlying construct measured is not the same across cultures (see, He & van de Vijver, 2012, for an overview). It can occur if a construct is differently defined or only has a partial overlap across cultural groups. For example, creativity has different connotations cross-culturally. It is understood as a process of sudden insight in the West, but as a long process that requires more mental efforts in East Asian cultures (Dahlin & Watkins, 2000). Another example would be the varying definitions of happiness in Western and East Asian cultures (Uchida, Norasakkunkit, & Kitayama, 2004): In Western cultures, happiness tends to be defined in terms of individual achievement, whereas in East Asian cultures happiness is defined in terms of interpersonal connectedness. In such a case, assessing creativity or happiness requires researchers to take culture-specific aspects into consideration and acknowledge the incompleteness of the overlap of this construct.
Method bias
Three types of method bias can be identified depending on the source of incomparability: the sample, the instrument, or the administration. Sample bias is the incomparability of samples due to cross-cultural variations in characteristics, such as different educational levels, students versus the general population, and urban versus rural residents. Instrument bias involves systematic errors derived from instrument characteristics such as self-report bias in Likert-type scale measures. The systematic tendency of respondents to endorse certain response options on some basis other than the target construct (i.e., response styles) may affect the validity of cross-cultural comparisons (van Herk, Poortinga, & Verhallen, 2004). Administration bias stems from administration conditions (e.g., data collection modes, group versus individual assessment), ambiguous instructions, interaction between administrators and respondents (e.g., halo effects), and communication problems (e.g., language differences, taboo topic). In summary, method biases are prone to have a global impact on cross-cultural differences, and if not appropriately taken into consideration, score differences introduced from method bias can be misinterpreted as targeted genuine cross-cultural differences.
Item bias
Item bias, also labeled as DIF, occurs when an item has a different meaning across cultures. An item of a scale is biased if persons with the same target trait level, but coming from different cultures, are not equally likely to endorse the item (van de Vijver & Leung, 1997; van de Vijver, 2013). Item bias can arise from poor translation, inapplicability of item contents in different cultures, or from items that trigger additional traits or have words with ambiguous connotations. For example, Pan, Wong, Chan, and Joubert (2008) had to exclude items regarding religion as a protective factor when studying resilience among mainland Chinese students, because these items were much less meaningful for this secular group compared with respondents from cultures emphasizing religiosity. Item bias is the most extensively studied form of bias and its analysis is conducted to establish scalar invariance (see below; van de Vijver & Leung, 2011).
The taxonomy of equivalence/invariance
Partially corresponding to types of bias, equivalence reflects the level of comparability across cultures. Hence, different levels of equivalence can be identified (van de Vijver & Leung, 1997). Measurement invariance is an alternative term for equivalence. The meaning of invariance is “whether or not, under different conditions of observing and studying phenomena, measurement operations yield measures of the same attribute” (Horn & McArdle, 1992, p. 117). van de Schoot and colleagues (2013) stated that the measurement structures of the latent factors and the items should be invariant to make meaningful comparisons in cross-cultural research. Following Fontaine (2005), we distinguish four levels of equivalence. Table 1 summarizes levels of equivalence/invariance and associated biases and analytical strategies.
Overview of the Different Types of Equivalence/Invariance, Their Sources of Bias, and Analytical Procedures.
Note. MDS = multidimensional scaling; EFA = exploratory factor analysis; CFA = confirmatory factor analysis; ESEM = exploratory structural equation modeling; IRT = item response theory; DIF = differential item functioning; SEM = structural equation modeling.
Functional equivalence indicates that the construct of interest has the same psychological meaning across cultures. The related bias is construct bias. For example, take a rain coat and an umbrella. Both have the same function, which is to shelter from rain, but they take on completely different forms. There is no direct statistical assessment for psychometric evidence on functional equivalence. Instead, indirect assessments may be utilized including theoretical analysis, document analysis, qualitative research, or nomological networks. These procedures should be used for a culturally appropriate development of psychological measurements.
Structural equivalence/configural invariance is reached if the theoretical construct is associated with the same observed variables, allowing the assessment of a construct using the same items across cultures. Related biases are construct bias (e.g., problems of domain underrepresentation if a construct is not fully covered by the content of a measurement) and method bias. Associated analyses to establish this level of equivalence are exploratory factor analysis (EFA)/principal components analysis (PCA) with procrustean target rotation, multidimensional scaling (MDS) with Generalized Procrustes Analyses (GPA), and multigroup confirmatory factor analytical procedures as well as alignment and exploratory structural equation modeling (ESEM) as newer analytical approaches (for details, see below).
Measurement unit equivalence/metric invariance means the measurement units are the same across cultures. In technical terms, it indicates factor loadings are comparable across cultures. Method bias (e.g., different levels of stimulus familiarity) and item bias can jeopardize this level of equivalence. This level of equivalence is also called weak invariance. Associated analyses are EFA/PCA in combination with DIF analysis, and multigroup confirmatory factor analytical procedures (including Bayesian structural equation modeling [SEM]), as well as alignment and ESEM.
Full score equivalence/scalar invariance indicates the measurement scale has the same point of zero (intercept) in different cultures. Lack of full score equivalence can be caused by method bias (e.g., response styles) and item bias. It can be established using different approaches, including classical techniques like multigroup confirmatory factor analysis (MGCFA) or novel approaches such as Bayesian SEM, alignment, and ESEM. This level of equivalence is also called strong invariance.
Relevance for subsequent analyses
Established equivalence has important implications on what can or cannot be compared across cultures. Structural equivalence constitutes the core element within the larger equivalence framework (Fischer & Fontaine, 2011)—If the structure is different across cultures, the likelihood is very high that constructs are qualitatively different and cannot (and should not) be compared. With metric invariance, the association of variables can be compared across cultures (e.g., unstandardized regression weights between creativity and conformity can be compared across cultures, if both scales reach metric invariance). Only scalar invariance allows validly comparing scale mean scores across cultures (van de Vijver & Leung, 1997). Comparing scale mean scores across cultures (using t tests, multivariate ANOVAs, SEM with mean structures, or multilevel analyses) is appropriate only if scalar invariance is established. Empirically, the higher the level of equivalence/invariance, the more difficult it is to establish.
Review of Classical Procedures
We now briefly introduce the most common “classical” procedures to test equivalence in cross-cultural studies 2 (e.g., Braun & Johnson, 2010). The choice of psychometric tools depends on the type of data (i.e., ordinal, interval level data), the aim of the research, and the analysis that follows. In general, these reviewed procedures are mostly used for multiple-item latent variables measured in surveys, but they should not be limited to such variables because they have the potential to benefit other types of data (e.g., multiple single items, designs measuring behaviors, and stimuli in experiments).
EFA with orthogonal procrustean target rotation
EFA (or PCA) is best used for culture-sensitive scale development and item selection for exploratory purposes, especially when the underlying structure of a construct is not clear. This procedure requires continuous, interval scaled data. The use of EFA or PCA (and various other dimensionality-reducing techniques) to study equivalence is based on a simple reasoning: identical constructs are measured in all groups if the structure of an instrument, as examined by these techniques, is the same across cultures (van de Vijver & Leung, 1997). Identity of factors (or dimensions) is taken as sufficient evidence for structural equivalence. Comparisons of multiple cultures can be conducted either in a pairwise or in a one-to-all manner. In the latter case, each culture is compared with the overall combined sample, which is particularly useful if many cultures are included in the analysis, making pairwise comparisons cumbersome. Procrustean target rotations are necessary to compare the structure across cultures, and the similarity of the target-rotated structures can then be compared via indices of factor congruence. One indicator is Tucker’s phi coefficient, which provides an estimate of the extent to which factor structures are identical across cultures. Values of the coefficient above .90 are usually considered to be adequate and above .95 to be excellent (van de Vijver & Leung, 1997). EFA/PCA can provide a quick, straightforward check on structural/configural invariance, and it has been applied to various constructs across cultures, including large-scale comparisons of personality structure (Schmitt, Allik, McCrae, & Benet-Martínez, 2007).
MDS with GPA
MDS (Borg & Groenen, 2005; Fischer & Fontaine, 2011) places items into an x-dimensional space, which generates a graphical display of the interrelations between items; items that are close together in space have a similar meaning and items that are placed far away from each other have dissimilar meanings. This procedure, as a flexible and less restrictive analytical tool, can be applied to binary data, ordinal and continuous data. Depending on the input data, the MDS can be run assuming either nonmetric (ordinal level) or metric (interval level) data. Once the number of dimensions is established, the coordinates within the same dimensionality for each cultural group are calculated. Structural equivalence can then be tested after the coordinates have been rotated to be maximally comparable using programs such as GPA (Commandeur, 1991; Fontaine, 2003). GPA rotates the MDS configurations from all cultural groups in such a way that they maximally resemble one another, which then allows statistical comparison of the dimensions (e.g., using correlation coefficients or Tucker’s phi). The fit index (centroid configuration) threshold is similar to the usual cutoff of above .90 applied in EFA with orthogonal procrustean target rotation (i.e., Tucker’s phi). An advantage of this method is that it has very few restrictions. However, it does not go further than evaluating the overall structural similarity and does not provide information on individual items, making it, therefore, rather descriptive. It has been applied in a large-scale data set (Liu et al., 2012) to determine the underlying dimensions of the meaning of historical events across 30 societies.
MGCFA
The conventional MGCFA is a theory-driven, confirmatory approach (Bollen, 1989; Byrne, 1994, Muthén, 1989). It is by far the most frequently used approach, which provides a rigorous test of configural, metric, and scalar invariance. The MGCFA approach is a multistep approach in which different models (with different constraints) are tested and compared with each other. A baseline model first tests for configural invariance (no constraints, items exhibit the same configuration loadings in each cultural group). Then, the metric invariance model with factor loadings constrained to be equal across cultural groups is tested against the baseline model. Finally, the scalar invariance model with item intercepts constrained to be equal across groups is tested against the metric invariance model. Further constraints such as equal error variance can be imposed on the data, but may not be essential for evaluating cross-cultural invariance purposes.
Vandenberg and Lance (2000) provided an overview of the fit indices of MGCFA developed for two to three group comparisons. They recommended to use the ratio of chi-square to degrees of freedom (<3.00), root mean square error of approximation (RMSEA; ≤.080), standardized root mean square residual (SRMR; ≤.080), Tucker–Lewis index (TLI, also called NNFI; >.90), and comparative fit index (CFI; >.90). Importantly, for comparing the increasingly restricted models, the benchmark change of the ΔCFI is –.01, models that show worse fit should be rejected (cf. Vandenberg and Lance, 2000). Recently, to accommodate large-scale international assessments, more liberal cutoff criteria were introduced for over 10-culture comparisons (see Rutkowski & Svetina, 2014; cutoff for RMSEA = .10, ΔCFI = .02, and ΔRMSEA = .03 from configural to metric invariance model, and both ΔCFI and ΔRMSEA within .01 from metric to scalar invariance model).
Despite its utility and rigor in detecting different biases and establishing equivalence, the MGCFA has been criticized for being too strict in the model restrictions for cross-cultural research (see van de Schoot et al., 2013). Particularly, the constraints for scalar invariance may be overly strict and unrealistic, especially in comparisons involving dozens of cultures (Lubke & Muthén, 2014). Byrne, Shavelson, and Muthén (1989) introduced partial measurement invariance as an alternative. Partial invariance means that only a subset of parameters (factor loadings and/or item intercepts) is constrained to be invariant, and the other subset of parameters is allowed to vary across countries. Consequently, the invariant subset can be compared across cultures (Byrne et al., 1989). Unfortunately, there is limited evidence on the cutoff proportion of noninvariant items that can be released in the partial invariance approach.
When a large number of cultures are involved, multilevel confirmatory factor analysis (CFA) can also be used to detect bias and test measurement invariance across cultures (see Fischer, 2009; Fontaine, 2008). The model constraints are straightforward compared with the MGCFA. 3 Jak (2017) illustrated that scalar invariance across cultures can be confirmed with equal factor loadings across levels and zero residual variance at the culture level in a two-level factor model. Moreover, in such models, country-level variables can be added to explain differences in the common or residual factors (for instance, to explain the noninvariance).
DIF
DIF analysis is a family of tools that aim to detect item bias. DIF focuses on item-level analysis and is used for developing new measures, adapting existing measures to other contexts, or for validating test scores (Zumbo, 1999). DIF analysis can be used as a follow-up test after structural equivalence was assessed in EFA or MDS. Quantitatively, DIF can be modeled with item responses via contingency tables (i.e., Mantel–Haenszel statistics) and regression models (i.e., logistic regression with binary responses). The common ground for these methods is to match and compare groups by conditioning items on the total score. In contrast to the other introduced approaches, which are based on classic test theory, DIF analysis can also be based on item response theory (IRT). It is a latent variable modeling approach that evaluates differences in response probability between cultures, conditional on some measure of the latent dimension. DIF detection is most often carried out with unidimensional constructs and typically compares a reference group with a focal group. Various extensions have been developed and refined to accommodate multidimensional constructs and more cultural groups (e.g., Hartig & Höhler, 2008; McDonald, 2000).
Critical Reflection on Classical Equivalence Tests and New Approaches
Conventional methods for measurement invariance tests have several drawbacks that may impede an adequate utilization of cross-cultural data. EFA and MDS are mainly useful to check construct bias and establish the very first level of equivalence; yet, there is no information on individual item performance. DIF analyses can detect misfit in the performance of items, usually between a reference and a focal group; yet, they cannot sufficiently deal with construct and method bias. Moreover, applying different DIF analysis methods to the same data set can lead to conflicting conclusions about some of the items, adding confusion about results (Morales, van de Vijver, & Poortinga, 2014). Once an item is flagged as having DIF, it is difficult to know the reason for misfit without further qualitative or mixed-methods investigation (Benítez & Padilla, 2014). In a nutshell, these approaches (i.e., EFA, MDS, and DIF) do not address all sources of bias and may be overly lenient in enabling scale score comparisons.
On the other spectrum, MG CFA as the most rigorous approach that can detect all three sources of bias may be too strict. The constraints on equal factor loadings and intercepts in the scalar invariance model are stringent and sometimes unattainable in comparisons involving multiple groups (Lubke & Muthén, 2014). Even a trivial, slight deviation from one or more parameters in any group would signal a lack of scalar invariance. Moreover, fit indices in MGCFAs do not tell whether the lack of invariance is caused by major model misspecification that can lead to erroneous conclusions or from minor misspecifications that do not have severe consequences for comparability (Byrne & van de Vijver, 2010; Oberski, 2014). Another problem with CFA is the imposing of a causal model that does not fit the psychological reality of psychological constructs, as most psychological phenomena are complex and hard to fit in a straightforward factor model.
All in all, these approaches have some merits and disadvantages. When lack of invariance is encountered, these psychometric methods do not advance our understanding of the source of the incomparability or facilitate further use of the data, without further research on the sources of the bias. We briefly introduce three recent approaches that do not assume the exact same loadings and intercepts across cultures and, hence, can be coined as more flexible and open for minor variations across cultures in various measurement aspects.
ESEM
ESEM (Asparouhov & Muthén, 2009; Marsh, Morin, Parker, & Kaur, 2014) is especially helpful when dealing with many groups where strong levels of invariance are often unattainable, and goodness-of-fit indices are weak. ESEM is an extension of the CFA approach. According to Marsh and colleagues (2010), ESEM integrates the best aspects of the CFA/SEM and the EFA approaches. The use of SEM in the exploratory approach means that it makes use of confirmatory tests of a priori factor structures as well as associations with latent factors, and multigroup tests of full measurement invariance (Marsh et al., 2014). The advantage of the ESEM is that it is not restricted because, in a conventional CFA approach, each item only loads on one factor. Items are allowed to load on all factors in ESEM. ESEM can be used for interrelated or independent factors by modeling oblique or orthogonal factor structures (Bowden, Saklofske, van de Vijver, Sudarshan, & Eysenck, 2016). ESEM is a fairly novel procedure that has not yet been often applied in content-related research; yet, it appears promising, and we would welcome further testing of its applicability in future research.
Bayesian approximate invariance
Bayesian approximate invariance testing is another promising approach to remedy some of the problems of conventional MGCFA. Instead of constraining the parameters of loadings and/or intercepts to be exactly the same across groups, this approach allows small differences in these parameters across groups (Muthén & Asparouhov, 2012; van de Schoot et al., 2013). The underlying rationale is that absolute invariance is unattainable and slight variations may not severely hinder comparability. Hence, a valid comparison can still be achieved. In two simulation studies (Muthen & Asparouhov, 2012; van de Schoot et al., 2013) involving a two-group and a 10-group comparison, respectively, appropriate model specifications (also called priors) that admit a certain degree of flexibility were proposed. In operational terms, pairwise differences in each parameter (loadings and/or intercepts) across groups can be modeled to follow a normal distribution with a mean of zero and a very small variance (.01 or .05).
Several applications with such prior specifications have been reported in multiple group comparisons (see, for example, Bujacz, Vittersø, Huta, & Kaczmarek, 2014; Cieciuch, Davidov, Schmidt, Algesheimer, & Schwartz, 2014; Davidov et al., 2015; He & Kubacka, 2015; Zercher, Schmidt, Cieciuch, & Davidov, 2015). We know that ignoring the lack of invariance may lead to biased comparative research results (Guenole & Brown, 2014). We do not know yet in the case of approximate invariance, whether the relaxed constraints would bias the parameters and distort the findings. In a few studies, it was shown that factor scores derived from a conventional CFA and the approximate approach were very similar (Davidov et al., 2015; Zercher et al., 2015). Simulation studies that examine the impact of different priors and the consequences of demonstrating approximate invariance are in much need (van de Schoot et al., 2013).
Alignment
Alignment is a third promising approach to estimate group-specific factor means and variances without requiring full measurement invariance (Asparouhov & Muthén, 2014). In a sense, alignment can be viewed as exploratory, where it incorporates a simplicity function similar to the rotation criteria used in EFA to discover the most optimal measurement invariance pattern (i.e., the simplest model with the fewest noninvariant parameters) and to estimate the factor mean and variance parameters in each group. This approach has been tested in multigroup CFA models with maximum likelihood as well as in Bayesian estimation, and it has been extended to IRT modeling (Muthén & Asparouhov, 2014).
Given its very recent development, only a few applications are available to date. De Bondt and Van Petegem (2015) used the Bayesian approximate invariance test with alignment optimization in evaluating the psychometric quality of an overexcitability scale, in which they showed the superiority of this combined approach compared with a conventional CFA with modification indices when comparing male and female students. Desa and Carstens (2015) proposed to apply this approach in the future in large-scale assessment contexts such as the Programme for International Student Assessment (PISA) and the Teaching and Learning International Survey (TALIS), as it legitimates mean comparisons in dozens of groups without requiring full measurement invariance. Weziak-Bialowolska (2014) tested gender ideology in the World Value Survey with CFA with and without alignment. She reported different patterns of country factor means from these two methods, and suggested that comparisons of the country rankings were valid provided that a correction for noninvariance of certain factor loadings and/or intercepts is applied in the alignment framework. Similar to the Bayesian approximate invariance tests, the implications of using alignment on the validity of cross-cultural comparisons await further investigation.
In sum, various older and newer techniques are available to assess measurement invariance and to distinguish error from cultural variance. Given the importance of establishing ground of comparability for drawing conclusions about how and in which form culture influences psychological states and processes, we further investigate to what extent invariance tests have been implemented in cross-cultural research.
Content Analysis: Implementation of Available Measurement Invariance Methods in Cross-Cultural Psychological Research
Method
Given the availability of methodological and analytical guidelines, cross-cultural psychological research has options for making adequate inferences about comparability and distinguishing between genuine cultural influence and bias. In this section, we present a content analysis of 454 articles reporting 519 cross-cultural studies. The analysis enables us to evaluate how well the available guidelines and techniques have been implemented from 2008 and 2015. We analyzed papers reporting quantitative designs that investigated two or more cultures, and we coded how many of these tested measurement invariance.
We focus our analysis on research published between 2008 and 2015 in the Journal of Cross-Cultural Psychology (JCCP) and two other journals spanning social and developmental psychology, namely, Child Development (CD) and Personality and Social Psychology Bulletin (PSPB). The inclusion of the two other outlets in addition to JCCP offers a comparison and possible generalization of trends in cross-cultural psychological research. We selected the fields of developmental and social–psychological research due to the increasing prominence of culture in their research topics. Furthermore, these fields are the most frequently researched topics in JCCP (cf. Table 2).
Research Methods in Cross-Cultural Psychological Research.
Note. Numbers refer to included studies. Numbers in brackets refer to survey studies. JCCP = Journal of Cross-Cultural Psychology; CD = Child Development; PSPB = Personality and Social Psychology Bulletin; EFA = exploratory factor analysis; PCA = principal components analysis; MGCFA = multigroup confirmatory factor analysis; MDS = multidimensional scaling; CFA = confirmatory factor analysis; IRT = item response theory.
Articles included that mentioned “culture” or “cultural” in their abstract.
Multiple options possible.
Following categories were coded: (a) number of sampled cultures, (b) research topic, (c) research design, (d) inclusion of student samples, (e) whether equivalence/invariance tests were conducted, (f) which invariance test was conducted, (g) which level of invariance was found, and (h) whether partial invariance was reported. Our method of analysis is a counting procedure, and we report frequencies.
Results
How well are the suggested methods implemented in cross-cultural psychological research? Table 2 summarizes the results of the content analysis. In JCCP, 382 quantitative studies were coded of which just over half (53%) included two cultures in their research design; 20% analyzed three to four cultures; 5% and 4%, respectively, compared five to nine and 10 to 19 cultures. Large-scale data sets including over 20 cultures were analyzed in 18% of the studies. A large majority of studies (78%) utilized a cross-sectional survey design, only 2% used longitudinal designs, and 14% conducted experiments or observation. University student samples were analyzed in 47% of the studies. Sixty-four out of the 382 studies (17%) assessed measurement equivalence; most of these were equivalence tests applied in survey studies (61 out of 299 survey studies, 20%; see numbers in brackets in Table 2).
The most frequent type of assessment for equivalence was MGCFA (in 41 studies, 11%), multilevel CFA was conducted in three studies (0.8%). EFA or PCA (mostly with procrustean target rotation) was used in 12 studies (3%). MDS was employed in two studies (0.5%) and IRT modeling was performed in three studies (0.8%). The findings of invariance tests showed weak invariance (e.g., metric invariance) in 19 studies (5%) and strong invariance (e.g., full score invariance) in 16 studies (4.2%). Three papers did not state which level of invariance the results revealed and one study could not establish invariance. Four studies (1%) were able to establish invariance after deleting noninvariant items. Other forms of invariance (e.g., DIF and partial scalar invariance; Eigenhuis, Kamphuis, & Noordhof, 2015) were reported in 25 studies (6.5%). Of the 25 studies that reported other forms of invariance, 24 studies revealed partial invariance. Interestingly, most of the studies that assessed equivalence/invariance reported some form of partial invariance (46 studies, 12%). These findings indicate that full invariance is difficult to achieve. Reasons might be that the measurements are indeed problematic or the applied invariance tests are too stringent and more flexible forms are required that allow some freedom of variation.
In addition to studies published in JCCP, we also analyzed CD and PSPB. In both journals, abstracts of papers published between 2008 and 2015 were searched for the inclusion of the term “cultur*.” In PSPB, 55 articles entailed cultural aspects, and culture comparisons were reported in 41 of these articles. These articles reported 106 cross-cultural studies that we analyzed according to their analytical approaches. The large majority of studies (82 studies, 77%) compared two cultures in their research design, 13% analyzed three to four cultures, and 3% analyzed 10 to 19 cultures. Large-scale data sets including over 20 cultures were analyzed in 8% of the studies. About half of the studies (49%) utilized a cross-sectional survey design, only 3% used longitudinal designs, and 39% conducted experiments or observation. University students were sampled in 79% of the studies. Only four out of the 106 studies (4%) assessed measurement equivalence; all of them were survey studies. In three studies, EFA or PCA was conducted for invariance testing, one study utilized MGCFA. The form of assessment covaried with the level of invariance yielded: Three studies found weak invariance, one study established strong invariance. Partial invariance was not mentioned in these studies.
Finally, 25 articles reporting 31 culture-comparative studies were coded for the journal CD. Here, somewhat fewer studies compared with JCCP and PSPB included only two cultures (39%) in their research design, 52% of the studies compared three to four cultures, and 10% included five to nine cultures. There was an equal distribution of survey studies (45%) and experiments/observations (45%). Ten percent of the studies were longitudinal designs. Unexpectedly, none of the culture-comparative studies tested for measurement invariance.
Discussion and Ways Forward
Summary of Results
Content analysis revealed that only 17% (64 out of 382 studies) of the cross-cultural comparative quantitative studies published in JCCP assessed measurement invariance; the figures were much lower in one social–psychological (4%) and one developmental (0%) journal. Although we limited our analysis to three journals only, it seems unlikely that other fields in psychology will reveal higher numbers. Hult and colleagues (2008) conducted a content analysis of cross-cultural survey studies published in five management journals (Journal of International Business Studies, Management International Review, Journal of World Business, Strategic Management Journal, and the Academy of Management Journal) from 1995 to 2005. Among the analyzed 167 studies, 24.6% reported metric invariance and 18.5% reported scalar invariance. These numbers are seemingly a bit higher than our findings. However, they are not directly comparable because Hult and colleagues only included survey studies and explicitly excluded experimental designs. Nevertheless, even these numbers reflect insufficient application of invariance testing.
The low penetration of invariance testing—even in the flagship JCCP—was surprising; however, a number of palpable reasons are at hand, which may have led the researchers not to conduct invariance tests. First of all, the lack of awareness and training in demonstrating measurement invariance before making any comparative inferences may be a viable reason.
Second, current measurement invariance tests are designed for multi-item measurements, which prevail in (mostly cross-sectional) survey designs. Studies implementing experiments with single-item outcomes (e.g., behavior), or surveys utilizing single items or indicator measures may regard the available invariance assessments as inadequate or not usable. Nevertheless, comparability is pivotal in cross-cultural research, regardless of the research design.
Third, a fair amount of studies analyzed large-scale data sets including more than 20 countries. For these studies, country-level analyses using aggregated data are quite common. For such country-level data, invariance is difficult to assess; comparability of scores is most of the time assumed. Preferably, invariance should be assessed before data aggregation. Alternatively, multilevel 4 CFA can be conducted. For instance, cultural values such as Welzel’s emancipation values are typically used as aggregated country scores and their measurability and meaning at the individual level is assumed to be equivalent across cultures. This assumption has been tested recently and findings show that secular and emancipative values are only equivalent among high-income countries (Alemán & Woods, 2016). In many cases, however, data are only available in aggregated form, rendering invariance testing at the individual level unfeasible. For application of both multigroup and multilevel analyses for the assessment of cultural constructs within the same data set, see Fischer and colleagues (2009).
In the next section, we put forward some recommendations on overcoming analytical problems for various research designs and we discuss possibilities for more flexible and creative solutions. Some of these recommendations are novel and suggestive and await empirical evidence on their usefulness.
Recommendations for Analyses
What to do if measurement invariance cannot be established?
Researchers may feel discouraged from their cross-cultural research if invariance cannot be established. However, findings of invariance tests should never discourage researchers from further exploring the true nature of a cross-cultural phenomenon. Instead, these tests should inform interpretations and conclusions regarding the impact of culture on the human mind. Poortinga (1989) stated that findings of noninvariance can reveal meaningful cross-cultural differences (see also Davidov, Dülmer, Schlüter, Schmidt, & Meuleman, 2012). Demonstrating measurement invariance is vital to draw valid comparative conclusions, yet noninvariance seems to be a persistent issue in multiple culture comparisons. When noninvariance of scales is found, one possibility would be to go back to single-item measures to bypass the invariance issue. Even though these single items may promise to be highly comparative (based on face value), this is a way backward instead of a convincing solution, as scales can capture various facets and have higher reliability compared with single-item measures. The recent developments (e.g., Bayesian approximate invariance, ESEM, and alignment) provide more flexible methods of assessing degrees of measurement invariance for scales. It also depends on the research question: For comparing associations of variables across cultural contexts, metric invariance is desirable, whereas scalar invariance is not a must. More consistent application of these tools can advance our understanding of cross-cultural comparability and comparisons.
Noninvariance may imply important cross-cultural findings that need greater scrutiny, indicating greater relativism of the psychological mind. More pressing, some psychological constructs pertain to a large culture-specific contribution, and hence, large cross-cultural differences are anticipated. Here, we would expect differences in the measurement properties as well as in the meaning attributed to the construct. Even if lack of measurement invariance does not permit cross-cultural comparisons of mean scores or associations between constructs, cross-cultural data may still provide insightful details. For instance, which contents load highest in which cultures, which external variables predict differences in item biases, and which cultural variables predict differences in loadings or intercepts (Davidov et al., 2012).
So, what can we do if our cross-cultural data are noninvariant? Davidov and colleagues (2012) summarized the following strategies: (a) Partial invariance is sufficient for meaningful comparisons, that is, at least two items of a construct should be invariant; (b) further analyses are conducted only for those countries that showed measurement invariance; (c) noninvariant items are deleted, and further analyses are conducted on those items that showed invariance; (d) sources of item bias can be explored by modeling predictors such as age and gender on item biases; and (e) sources of noninvariance can be explored by modeling country-level predictors on error terms (see below). Each of these suggestions comes with advantages and limitations. The decision on which of these options to choose will depend on the research question and/or the data properties.
Single-item measures
How can we assess invariance for single-item measures? Currently, there is no psychometric tool specifically designed to test measurement invariance of single-item measures. New methods need to be developed; meanwhile, alternative routes may be pursued in getting some sense of comparability of single-item measures. We propose three possible ways forward: (a) correlations with external variables, (b) multiple single-item outcomes form a latent construct, and (c) multilevel modeling using random-effects models.
First, in single-level analysis, validation of a single item across cultures may involve correlations with external variables to establish the nomological network and validity of the construct. The external variables are selected based on theoretical considerations and previous empirical support for validity. This is also referred to as external linkages validation (see, for example, Welzel & Inglehart, 2016). External linkages can paint a holistic picture of the construct of interest because these external linkages contribute to convergent and discriminant validity. This analysis could be done using correlations or MDS in nomological network–type analysis. This can also be applied to multi-item constructs, in which the items across cultures vary and are culture-specific (see, for example, Boehnke et al., 2014).
Second, if no external linkage variables are available, multiple outcomes could be included in the research design to estimate a latent variable. In contrast to multi-item measures that are developed to capture narrow psychological constructs, multiple single-item indicators of a broader psychological phenomenon could be subsumed in a latent variable. This suggestion is akin to multiple indicators multiple causes (MIMIC) models that include multiple constructs, each measured using a single item (e.g., prosociality entailing helping behavior and empathetic feelings).
The third suggestion to deal with bias in single-item measures is to estimate random effects of single-item constructs in a multilevel analysis (Raudenbush & Bryk, 2002). Possible noninvariance of single items could be taken into consideration using random-effects estimation in multilevel modeling. This procedure has the advantage over fixed effects models that measurement noninvariance is modeled explicitly. In addition, sources of noninvariance can be assessed. Davidov and colleagues (2012) addressed noninvariance by modeling differences in intercepts of a latent factor (i.e., lacking scalar invariance) and entering a between-level latent variable and a random term. The between-level latent variable was predicted by the external variable Human Development Index (HDI), which contributed significantly to explain why the highest level of invariance could not be established. This means that the variance of the random term was reduced after entering the predictor. Even though this example used a three-item measure, we argue that this could be applied to single-item measures as well.
Experimental research
Another area that requires attention is invariance of experimental manipulations across cultures. In experiments, it is common to manipulate one or more variables to different degrees (reflected in different conditions) to scrutinize causal impact on an outcome. However, we cannot be sure whether the stimuli materials are invariant in their meaning. Pretests may reveal that the effects on outcomes are similar across cultures and this may be taken as an assurance of comparability of manipulations (in line with our discussion on correlations with external variables). However, invariance of experimental manipulations should be assessed explicitly and independently from its effects on the outcome. A possible solution would be to employ a manipulation check with multiple items. This allows two sets of analyses: (a) testing measurement invariance of manipulation check items and (b) assessing the similarity of associations between experimental manipulation and manipulation check. If measurement invariance of manipulation checks is achieved, an indirect indication of invariance of the manipulated construct is provided. Moreover, the correlations between stimuli and manipulation check can also provide insightful information on the comparability of experimental material and procedures. Once cross-cultural comparability of stimulus material is established, effects of an independent variable on a dependent variable across cultures can be compared and true cross-cultural differences or similarities can be assessed.
Early experimental research on perception and cognition provides excellent examples for targeting invariance issues in experimental work (Deregowski, 1980, 1989; Deregowski & Bentley, 1986; Segall, Campbell, & Herskovits, 1966). Segall et al. (1966; for a summary, see Berry, Poortinga, Breugelmans, Chasiostis, & Sam, 2011) assessed whether respondents understood the visual illusion tasks and checked the equivalence of the independent variable across cultural groups.
Utilization of metadata
With computer-based assessments that record event logs during survey responding or experiments, additional metadata (i.e., log files) can be used to validate construct, enhance measurement, and detect aberrant responses (e.g., Bassili & Flecher, 1991). For example, Goldhammer et al. (2015) linked time on task in an online assessment to task difficulty and individual skills, and they proposed to factor random response time effect in the measurement model. If used cross-culturally, measurement comparability can be enhanced. In the Programme for the International Assessment of Adult Competencies (PIAAC), response time in the cognitive assessment has been used to indicate missing value or nonresponse (Organisation for Economic Co-Operation and Development [OECD], 2013).
Recommendations for Research Designs and Assessment (Preventive Measures)
Extensive pretests
Enhancing data comparability across cultures starts from the design and administration of a study. Cognitive pretests of items and/or scales may enhance functional equivalence, which can lead to stronger equivalence in the remaining equivalence levels. Cognitive interviewing is a technique in which the cognitive processes are uncovered that respondents use when answering questions (see, for example, Behr, Braun, Kaczmirek, & Bandilla, 2014). Cognitive interviewing is helpful to serve as a preventive measure to ensure appropriate item selection and ensurance that the item content is understood in a similar manner across cultures (Willis, 2015). Recent approaches combine a so-called webprobing procedure (asking participants directly how they understand items in online surveys) with quantitative measurement invariance tests. This is a novel approach and seems to be a fruitful development by combining qualitative and quantitative insights, which may help to gain more understanding of functional equivalence (qualitative insight), on one hand, and measurement invariance testing (quantitative insight) ensuring comparability, on the other (Meitinger, 2017).
Innovative design features
Innovative design features such as anchoring vignettes, the measurement of overclaiming (i.e., a technique to capture the self-enhancement tendency independent of one’s ability, which is then subsequently used to correct content scores), and alternative item format (e.g., forced-choice format) have been proposed to improve comparability (Kyllonen & Bertling, 2014; see also Tourangeau, Rips, & Rasinski, 2000). Some of the approaches are proven to be effective. For instance, using anchoring vignettes to rescale Likert-type scale report can improve levels of measurement invariance (He, Buchholz, & Klieme, 2017). The Big Five personality traits measured from a forced-choice format instrument showed scalar invariance in multiple countries (Bartram, 2013).
Context assessment
Cross-cultural research is concerned with the impact of cultural aspects on psychological outcomes and processes. At the same time, individuals create, maintain, and change culture and the elements that construe culture. Disregarding meso-level processes (e.g., family or community effects), cross-cultural data at least possess two levels: individual and cultural. Implementing both levels (e.g., capturing individual psychological constructs and cultural dimensions) into a cross-cultural research design enables testing measurement invariance as well as assessing sources of cultural variations in measurement aspects (sources of noninvariance) and in psychological outcomes (associations and/or means if appropriate levels of invariance are established). Researchers may not, by default, anticipate invariance. Instead, studies could be designed that enable further analyses independent of the outcome of invariance tests. On the contrary, researchers whose research questions concern societal or cultural processes (e.g., relationships between country-level variables for elucidating societal functioning) may question the importance of individual-level comparability. Welzel and Inglehart (2016) replied to the noninvariance criticism (see Alemán & Woods, 2016) by arguing that external links are sometimes more important than individual-level consistencies. This debate may partly be steered by different disciplinary foci. Yet, we maintain that the main issue remains: Cross-cultural noninvariance at the level of assessments requires scrutinizing what exactly is being captured. External linkages may account for cultural influences as much as they could explain differences in measurement properties. Hence, external linkages are a welcome supplement, but not a substitute for invariance testing.
Sampling of cultures
Our content analysis also showed that two-country comparisons still prevail in the field. Culture can be defined in numerous ways: most prominently, individualism–collectivism (Hofstede, 2001), egalitarian values (Schwartz, 2006), secular and emancipative values (Welzel & Inglehart, 2016), or the concept of tightness (e.g., Gelfand et al., 2011), just to mention a few cultural dimensions. Countries are selected on the basis of these cultural dimensions, by, for instance, selecting countries at far ends of a cultural continuum. Such theory-based sample selection is similar to an experimental manipulation aimed at testing the impact of that cultural dimension on psychological states or processes. The common set of comparisons involves two cultures or countries; however, using only two points (samples) is a poor representation of a cultural dimension and, as a consequence, misinterpretations of differences are likely. The interpretation paradox means that differences between samples and/or countries that vary regarding social, economic, and cultural factors are easily found, but the interpretation of found differences and what factors explain these differences are difficult to determine (van de Vijver & Leung, 2000). Consequently, if only two countries are compared and differences are found, the “cultural differences” may also be due to any other varying characteristics of the samples. As already mentioned, two-country comparisons are dominating the field, but due to the interpretation paradox (van de Vijver & Leung, 2000), we are not able to clearly determine what exactly explains the differences. Therefore, we recommend sampling of more than two cultures and ruling out alternative explanations.
New forms of data collection
Recently, novel forms of data collection and data retrieval have been embraced by psychologists, such as (a) publicly accessible online data (also referred to as big data) including public data from social media (Facebook, Twitter, Tumblr, Snapchat) or data mining, (b) response time measures, (c) geo data (environmental data, Global Positioning System [GPS] allocation), (d) mobile data (mobile psychophysiological data collection, mobile phone interactions including rapid response systems, Internet traffic); c and d also entail the Internet of Things (IoT). The advantage of these novel methods is that information beyond individuals’ self-reports are captured as proxies for behaviors and motivations. How can we promote these new methods for cross-cultural research? These data sources can be collected and aggregated at the cultural level. They provide fruitful cross-cultural insights into human behaviors and psychological responses. As data assessments are commonly not at the individual level (or data sources cannot be traced to individuals), common invariance tests may not always be suitable. One creative solution may be to assess the relationships between multiple indicators (as validity checks rather than structural paths) and to plot their interrelations in MDS. The dimensional solutions can then be compared using GPA between cultures as an indication of functional and structural similarity.
Conclusion
Our aim was to review methods of invariance testing in cross-cultural studies and examine their current prevalence in research. The most surprising finding was that measurement invariance tests are still treated as a stepchild, even in the flagship JCCP in which only 17% of quantitative culture-comparative studies assessed measurement invariance. In two journals of social psychology and developmental psychology, the numbers are even lower (4% in PSPB, 0% in CD). The most pressing issue revealed in our review is the lack of measurement invariance testing. Our main conclusion here is that invariance testing is not an add-on analysis, but a necessity for making meaningful cross-cultural comparisons. This is comparable with the taken-for-granted inclusion of psychometric properties analysis (i.e., Cronbach’s α or ω) of a given scale to communicate that the scale is reliable. We urge authors to include measurement invariance testing as a common part of the methods section in which the reliability and comparability of cross-culturally assessed measures are established and reported in detail to ensure that the conclusions drawn are not prone to bias. Comparability concerns are not limited to survey designs; novel, flexible, and creative assessments may need to be developed for a broader implementation of invariance tests across diverse research designs. Hence, we advocate for always including measurement invariance testing—in any classical, modern, or creative form—for the findings of (partial) invariance and noninvariance, and enhance the interpretability, meaning, and impact of cross-cultural research.
Footnotes
Acknowledgements
We would like to thank Catalina D. Dumitru, Tanja Baumeister, Vanessa Grebe, and Jan Stockhausen for their help in coding for this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
