Abstract
In this systematic review, we assessed available evidence for cross-cultural measurement invariance of assessment scales for child and adolescent psychopathology as an indicator of cross-cultural validity. A literature search was conducted using the Medline, PsychInfo, Scopus, Web of Science, and Google Scholar databases. Cross-cultural measurement invariance data was available for 26 scales. Based on the aggregation of the evidence from the studies under review, none of the evaluated scales have strong evidence for cross-cultural validity and suitability for cross-cultural comparison. A few of the studies showed a moderate level of measurement invariance for some scales (such as the Fear Survey Schedule for Children-Revised, Multidimensional Anxiety Scale for Children, Revised Child Anxiety and Depression Scale, Revised Children's Manifest Anxiety Scale, Mood and Feelings Questionnaire, and Disruptive Behavior Rating Scale), which may make them suitable in cross-cultural comparative studies. The remainder of the scales either showed weak or outright lack of measurement invariance. This review showed only limited testing for measurement invariance across cultural groups of scales for pediatric psychopathology, with evidence of cross-cultural validity for only a few scales. This study also revealed a need to improve practices of statistical analysis reporting in testing measurement invariance. Implications for future research are discussed.
Introduction
Childhood psychopathology is increasingly recognized as an important issue in global childhood morbidity (Palfrey, Tonniges, Green, & Richmond, 2005) due to the high and increasing contribution of psychopathology to disease burden among children (Simpson, Bloom, Cohen, Blumberg, & Bourdon, 2005; Smit et al., 2009). This observation has spurred a deluge of epidemiological research establishing the prevalence and characteristics of childhood psychopathology across the globe (e.g., Merikangas, Nakamura, & Kessler, 2009) and has increased global attention to child and adolescent mental health (CAMH) initiatives (World Health Organization, 2003). Attention to childhood psychopathology has led to the development of different assessment scales. The most recent purposive review located 103 published scales for assessing childhood psychopathology (Verhulst & van der Ende, 2006). The reliability and validity of these scales has been examined and many are currently used in research and clinical settings across the world.
Cross-cultural differences in childhood psychopathology continue to pose a challenge to the use of these scales in cross-cultural studies given that the prevalence rates and characteristics of childhood psychopathology differ across cultural/ethnic groups (e.g., Achenbach, Rescorla, & Ivanova, 2012; Canino & Alegria, 2008). For example, in a recent study using the Strengths and Difficulties Questionnaire (SDQ), a 2.8-fold difference in the rates of general psychopathology was observed among adolescents across several countries (Atilola, Balhara, Stevanovic, Avicenna, & Kandemir, 2013). There are many potential sources for differences across cultures when a quantitative scale is used among children. These include: inherent cross-cultural/ethnic differences due to economic, social, and cultural factors (e.g., Camras & Fatani, 2006; Hackett & Hackett, 1999; Lehman, Chiu, & Schaller, 2004; Mabe & Josephson, 2004; Nikapota & Rutter, 2008); variations in evaluation methods used; variations in the level of child development; and differences in the expression of specific psychopathology (e.g., Achenbach et al., 2012; Goodman et al., 2012; Heiervang, Goodman, & Goodman, 2008).
More distinctly, differences in prevalence rates and psychopathological expressions might be imposed by the theoretical construct of the assessment method used (i.e., construct validity). The assessment method may not necessarily operate in the same way and its underlying construct might not have the same theoretical structure for different cultural/ethnic groups (i.e., lack of measurement invariance), leading to biased estimations (Borsboom, 2006; Dimitrov, 2010). The prevailing assumption among researchers using health assessment scales is that if the theoretical construct (i.e., underlying factorial structure) of a scale developed in one language is replicated across different language groups, this will guarantee that the scale will operate equivalently across these groups and as such is suitable for cross-cultural/ethnic comparisons (e.g., Byrne & Watkins, 2003). However, a prerequisite for cross-cultural/ethnic comparisons is that the theoretical construct is measured in each culture in the same way—that is, that construct equivalence is achieved for the scale representing the theoretical construct when tested simultaneously across cultural/ethnic groups (He & van de Vijver, 2012).
Therefore, in order to compare estimates by one scale across various cultures/ethnic groups, it needs to be demonstrated that its factorial structure is invariant across different ethnic/cultural groups (i.e., cross-cultural factorial invariance; Borsboom, 2006; Byrne & Watkins, 2003; Dimitrov, 2010; Gregorich, 2006; Milfont & Fisher, 2010). Establishing the cross-cultural validity of scales used in CAMH research will improve the accuracy of comparative estimation of regional burden of childhood psychopathology and the tracking of progress of interventions in multinational contexts. In addition, valid cross-cultural research has been identified as one of the gaps in global CAMH research and has been suggested as a key agenda for advancing the utility of CAMH research in multinational contexts (Atilola, 2015).
One important question that is yet to be empirically answered is: How many of the over 100 scales currently used in quantitative CAMH research have cross-cultural validity? A first step in answering this question is to determine how many of the scales have been tested for cross-cultural validity. The second step is to determine how many, of those that have been tested, have been found to have cross-cultural validity and at what level of evidence. Answering these questions can not only guide cross-cultural CAMH researchers around the world on the suitability of available scales for such research, but can also set future directions for cross-cultural validation of CAMH scales. Accordingly, in this paper, we present a systematic review of data from studies that have tested original and different language versions of available scales for pediatric psychopathology for cross-cultural validity.
Method
Search strategy and study selection
A literature search was conducted using the Medline, PsychInfo, Scopus, Web of Science, and Google Scholar databases to identify studies on cross-cultural measurement invariance. The general eligibility criteria were (a) the study sample included children and/or adolescents; (b) a scale that assessed an aspect of child and/or adolescent psychopathology was evaluated for measurement invariance across at least two ethnic/cultural groups; and (c) the study provided details on the method used and results of measurement invariance testing. The final search was conducted on September 30, 2014. The following search terms with their variations were used: scale name or psychopathological symptoms were combined with “measurement invariance” or “measurement equivalence” or “factorial invariance” or “factorial equivalence” or “differential item functioning” or “cross-ethnic” or “cross-racial” or “cross-cultural” or cross-national” or “cross-country.” The age filter (children/adolescents up to 18 years) was used during the search, while no language or publication date limitations were imposed.
Two coders (DS and PJ) extracted data from selected studies including: the scale name, age group and population, country, language version, cultural/ethnic groups evaluated, measurement invariance method used, and the main outcome for cross-cultural measurement invariance. All search results were combined into a single master database and duplicates removed. Figure 1 presents the PRISMA flow diagram for the current review (Preferred Reporting Items for Systematic Reviews and Meta-Analyses [PRISMA]; Moher, Liberati, Tetzlaff, Altman, & the PRISMA Group, 2009).
Flow diagram of study selection.
Definition and determination of cross-cultural validity
Evaluation of cross-cultural factorial invariance of a scale is usually aimed at testing measurement invariance, that is, how the items measure the latent construct of the scale across cultural/ethnic groups, and structural invariance, how the latent factors are distributed and related in the separate populations (Dimitrov, 2010; Meredith & Teresi, 2006). Measurement invariance generally refers to the invariant operations of the items or the extent to which the content of each item is being perceived and interpreted in exactly the same way across different cultural/ethnic groups. In other words, the item does not exhibit differential item functioning (DIF) across the groups if taken from the perspective of item-response theory (IRT). Structural equivalence refers to the underlying theoretical structure of the measuring scale or the extent to which the latent factors are distributed and related in the same way in the separate populations (Dimitrov, 2010).
There are several methods for establishing measurement invariance or for testing DIF. The most frequently used methods are multigroup confirmatory factor analysis (MG-CFA) based on structural equation modeling (SEM), and DIF detection with IRT. Both approaches test for the equality of item-level and subscale-level true scores for persons with identical latent scores, irrespective of group membership (Raju, Laffitte, & Byrne, 2002). The third method frequently used is ordinal logistic regression (OLR) to detect DIF, which is principally based on observed scale scores and not on the latent scores as the previous two. Furthermore, it is also possible to use exploratory factor analysis to assess factorial invariance (i.e., factorial similarity) using the coefficient of congruence (rc) and the salient variable similarity index (s) which is derived from the relation between pairs of factor loadings for corresponding factors and denotes shared variance between factors (Cattell, 1978; Reynolds & Carson, 2005). Finally, the multiple indicators multiple causes, a method also based on SEM, is used for measurement invariance testing (MIMIC; Joreskog & Goldberger, 1975). MIMIC models can test if members belonging to different groups vary in the probability of endorsing an item after being equated on the underlying latent trait that the item is intended to measure (Joreskog & Goldberger, 1975).
In the present review, a systematic approach was applied to assess the strength of the evidence for the cross-cultural validity of CAMH scales, based on the methods used to establish cross-cultural measurement invariance. In the case of MG-CFA used, several types of measurement invariance form a nested hierarchy, including dimensional, configural, metric, scalar, and error factorial (Byrne & Watkins, 2003; Gregorich, 2006). When testing one scale across different groups, dimensional invariance means that the same number of common factors is present; configural invariance means that the same items are associated with the same factors; metric (i.e., weak measurement) invariance means that the common factors have the same meaning; scalar (i.e., strong measurement) invariance means that the intercepts or threshold of the items are equivalent; error (i.e., strict measurement) invariance means that regression residual variances for all items are equal. At least strong measurement invariance must exist in order to allow latent means comparisons across groups and to claim that a scale is suitable for cross-cultural comparisons.
In situations where there is no perfect type of measurement invariance (i.e., full measurement invariance), but neither is there complete noninvariance, it is possible to talk about partial measurement invariance (Byrne & Watkins, 2003; Dimitrov, 2010). In the case of partial measurement invariance, only those items that meet criteria for a strong measurement invariance model should be included in composite measures when scores for the scales are to be compared cross-culturally (Gregorich, 2006). To claim that a scale is suitable for cross-cultural comparisons when considering the presence of DIF through common IRT or ORL parameterizations, there must be no or few DIF items associated with negligible individual- or group-level impact (Meredith & Teresi, 2006). According to IRT and OLR, two types of DIF, uniform and nonuniform, can be detected. Uniform DIF is evident when the difference in item response probabilities is constant across the scale. Nonuniform DIF occurs when the direction of DIF differs in different parts of the construct scale. Finally, in the case of use of the coefficient of congruence for measurement invariance testing, an rc value of 0.90 or higher is an arbitrary figure that is generally used to indicate invariance across groups (Reynolds & Carson, 2005).
Best evidence synthesis
Level of evidence for cross-cultural validity.
In the case that 3 or more studies are available for one scale, but a greater majority of the studies demonstrated full/partial strong measurement invariance or no/few DIF, level of evidence would be estimated based on these studies, but graded one level below.
Results
Main characteristics of studies evaluating cross-cultural measurement invariance.
Note. *MG-CFA = multigroup confirmatory factor analysis; EFA = exploratory factor analysis; CFA = confirmatory factor analysis; ORL = ordinal logistic regression; IRT = item-response theory; MIMIC = multiple indicators multiple causes. **The scale has items that are cross-culturally noninvariant.
For symptom-specific scales, there were moderate levels of evidence for cross-cultural validity for the Fear Survey Schedule for Children-Revised (FSSC-R) and the Multidimensional Anxiety Scale for Children (MASC) self- and parent report, Revised Children's Manifest Anxiety Scale (RCMAS) self-report, Mood and Feelings Questionnaire (MFQ) self-report, Revised Child Anxiety and Depression Scale (RCADS) self-report, and the Disruptive Behavior Rating Scale (DBRS) parent report.
Considering the Strengths and Difficulties Questionnaire (SDQ) as a scale for general psychopathology, conflicting evidence was found from six studies for measurement invariance of its self-report version, while one study reported weak evidence for the parent report and strong evidence for the teacher report. For the Child Behavior Checklist (CBCL) and Teacher Report Form (TRF), findings from all six available studies showed no evidence of measurement invariance. There was a moderate level of evidence for the Youth Self Report (YSR).
The rest of the scales either showed weak, conflicting, or lack of cross-cultural measurement invariance (Table 2).
Discussion
Demonstrating cross-cultural measurement invariance for a scale implies generalizability of aspects of its construct validity such that the scores of that scale generalize across different cultural/ethnic groups. This is a prerequisite for cross-cultural comparisons of psychopathology which assume that the scale measures the same theoretical construct in each culture in the same way (J. He & van de Vijver, 2012). In the present systematic review, we have evaluated the suitability of available scales for CAMH for cross-cultural comparisons based on their documented measurement invariance.
This review found that there has been limited testing for measurement invariance across cultural/ethnic groups of scales used to assess pediatric psychopathology, either in their original or translated versions. Although about 100 different scales for child and adolescent psychopathology were published before 2006 (Verhulst & van der Ende, 2006), and certainly many more since, we could locate only 26 scales with some data about cross-cultural measurement invariance. Available studies mostly evaluated scales for anxiety, depressive, mania, and behavioral symptoms (predominantly attention deficit/hyperactivity disorder), and general psychopathology, predominantly in general populations of children and adolescents. A great majority of the data available on cross-cultural measurement invariance was for the original scales developed in the USA, with two to four ethnic groups evaluated. There were only 11 scales with measurement invariance data available from other countries using different language versions, in studies evaluating either ethnic groups in one country or cultural groups across several countries.
The overall evidence suggests that few of the pediatric psychopathology scales evaluated in the present review have strong evidence for cross-cultural validity, which suggests that cross-cultural comparison of childhood psychopathology using currently available scales should be a cautious exercise. Based on the findings of the current study, moderate levels of evidence were found for the FSSC-R, MASC, RCMAS, MFQ, RCADS, and DBRS. Fortunately, these scales measure the common child and adolescent psychopathologies like anxiety, depression, and disruptive behaviors. However, the two studies evaluating the DBRS only for symptoms of attention deficit hyperactivity disorder and oppositional defiant disorder showed some DIF items and these items should not be included in the composite measures when scores for these scales are to be compared cross-culturally (Gregorich, 2006). The same applies to the RCADS, because one study demonstrated partial strong measurement invariance. The current data for other symptom-specific scales indicate that no or weak evidence exists for their use in cross-cultural comparisons.
Considering general psychopathology scales, there was weak evidence that the Short Form Assessment for Children could be used in cross-cultural comparisons (SAC; Glisson et al., 2002). All available studies for the SDQ showed conflicting evidence for measurement invariance for its self-report, weak for the parent report, but strong for the teacher report. In particular, two well-designed studies including different language versions of the SDQ from 12 countries in total demonstrated cross-cultural measurement noninvariance of the self-report version (Essau et al., 2012; Stevanovic et al., 2014), which could be sufficient to claim that the SDQ self-report is not suitable for use in cross-cultural comparisons. Additionally, the SDQ parent and teacher report in English and Dutch were only evaluated across ethnic groups of two countries. Further testing of the two forms across several countries is needed in order to claim conclusive evidence for their use in cross-cultural comparisons. On the same note, considering the Achenbach System of Empirically Based Assessment (ASEBA; Achenbach et al., 2008), for the CBCL and TRF there was no evidence that the two could be used in cross-cultural comparisons, while there was a moderate level of evidence for the YSR. Thus, this review found that the SDQ and the ASEBA might have cross-cultural measurement noninvariance in assessing general psychopathology and their scores might not generalize across different cultural/ethnic groups or need to be considered in terms of specific population norms if such are available across different cultural/ethnic groups.
The key limitation of our review is the possibility of publication bias, we examined only articles published in English and there may be relevant data published in other languages in local journals or sources not available through the searches carried out. It was observed that authors frequently titled their articles or gave inappropriate keywords in such a way that it may not be apparent that the study assesses measurement invariance. Additionally, we could have missed studies that would be identified by a hand search and accessing grey literature. Furthermore, we used an approach to assess levels of evidence for cross-cultural comparisons that has not been used previously and it has to be further evaluated or another method should be considered.
With regard to the included studies, a great majority of the studies dealt with measurement invariance of the scales within a single country, considering only migrant ethnic groups, and thus their findings might not generalize to ethnic minority children or adolescents in their host nations/countries. In addition, data for only eight scales was available from two or more studies to assess evidence. Finally, in the reviewed papers MG-CFA was the most common method used to assess measurement invariance, but there were weaknesses in the presentation of the results of included studies. While the items in the reviewed instruments were all polytomously scored, the MG-CFA used in five studies (Russell et al., 2008; Steele et al., 2006; Trent et al., 2013; Varela & Biggs, 2006; Veen et al., 2011) assumed that observed items were continuous and normally distributed. This is a common mistake, which could yield incorrect results in testing measurement invariance (Kim & Yoon, 2011). Moreover, when assessing measurement invariance, configural invariance should be distinguished from dimensional invariance. Dimensional invariance means that an instrument consists of the same number of factors across groups, while configural invariance shows that each domain of interest is measured by the same set of items across groups. In three of the reviewed articles (Ivanova et al., 2010; Ivanova, Achenbach, Dumenci, et al., 2007; Ivanova, Achenbach, Rescorla, et al., 2007), the term configural invariance was used mistakenly instead of dimensional invariance. Although IRT is one of the well-known DIF detection methods, our review showed that only one study (Lambert et al., 2007) specifically used IRT in testing measurement invariance. The main reason is that IRT requires two crucial assumptions including unidimensionality and local independence to estimate the model parameters. For instance, as reviewed in the present study, in testing measurement invariance of five instruments including Young Mania Rating Scale (YMRS), General Behavior Inventory (GBI), Child Mania Rating Scale (CMRS), Mood Disorders Questionnaire (MDQ), and Child Behavior Checklist (CBCL), unidimensionality and local dependence were not strictly met (McDonnell, 2010). Therefore, OLR was used as an alternative to the IRT. These issues show the extent to which the results of DIF analysis could be affected by the use of different statistical methods.
Conclusion: Directions for future research
Considering the above findings and limitations, this review suggests that there is a critical need for more cross-cultural measurement invariance studies on available scales. First, studies evaluating original scales need to include three or more different ethnic/cultural groups, because there would be greater variability in the measuring construct among the groups to be detected and evaluated for cross-cultural measurement invariance if more groups were present. Second, studies evaluating one scale across several countries need to include the original as well as its translations. This is an important aspect of cross-cultural invariance testing, because demonstrating measurement invariance for the construct of the original scale in the country of its origin does not imply that its construct is transferred into its translations, thus it has to be simultaneously tested with all versions for cross-cultural measurement invariance. Third, studies should test all available rating forms for each scale for cross-cultural measurement invariance. Demonstrating cross-cultural invariance for a self-report version of a scale does not guarantee that its proxy-report is cross-culturally invariant, although there could be measurement invariance across informant reports (Dirks et al., 2014). Fourth, there is a need to further evaluate scales for which there is some evidence of invariance in order to provide more clear data on the appropriateness of their cross-cultural use, especially for the entire SDQ and ASEBA measurement systems that are the most frequently used worldwide. Fifth, it is important to evaluate scales for cross-cultural measurement invariance that measure other psychopathological symptoms frequently present in children and adolescents. Sixth, future studies should focus on clinical samples of children with psychopathology, considering that almost exclusively the focus of available studies was on general populations. Finally, other methods besides MG-CFA should be included more frequently and studies that include methods based on IRT/ORL would be of substantial importance. Based on the fact that some scales showed partial measurement invariance, we believe that more data will be gained by IRT, because only those items that meet criteria for a strong measurement invariance model should be included in composite measures, when scores for the scales are to be compared cross-culturally (Gregorich, 2006).
In summary, this review showed that there has been limited testing for cross-cultural measurement invariance of scales for child and adolescent psychopathology and available data are insufficient to draw conclusions regarding their cross-cultural validity. Based on the evidence, a few of the scales showed a moderate level of measurement invariance (i.e., the Fear Survey Schedule for Children-Revised, Multidimensional Anxiety Scale for Children, Revised Child Anxiety and Depression Scale, Revised Children's Manifest Anxiety Scale, Mood and Feelings Questionnaire, Disruptive Behavior Rating Scale, and Youth Self-Report), which may make them suitable in cross-cultural comparative studies. Nevertheless, more replication studies are needed with available scales that will either consider different language versions or use more rigorous methods for measurement invariance testing. With more data available on cross-cultural measurement invariance and improved practices in methods used, it would be possible to revise available scales for cross-cultural use or to develop new ones.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
