Abstract
In this study, we illustrated issues related to measure invariance in cross-cultural research involving instrument translation between Chinese and English. We translated and back-translated the third edition of the Behavioral Assessment for Children-Self Report of Personality (BASC-3-SRP) and administered it to 1,574 youth in China and 512 youth in the United States. We found that despite a rigorous approach to achieving linguistic equivalence, statistically demonstrating acceptable internal consistency and construct validity, measurement invariance tests revealed that six of the 16 BASC-3-SRP subscales lacked measurement invariance. Constructs for the first three of the six subscales that lacked measurement invariance (i.e., Negative Attitude toward School, Negative Attitude toward Teachers, and Self-Esteem) are known to be conceptualized differently in collectivistic societies, while constructs for the second three subscales (i.e., Atypicality, Sense of Inadequacy, and Hyperactivity) lacked measurement invariance without known cultural reasons. These results highlight instrument development issues and measurement variance issues that cross-cultural researchers must grapple with.
Introduction
Conducting cross-cultural quantitative research often requires the translation of an instrument developed and validated in one language into another (Byrne, 2016). Traditionally, linguistic equivalence is the main requirement and justification for the use of such a measure. There are now well-established protocols that guide instrument translation and adaptation such as the guidelines on test adaptations from the International Test Commission (ITC, 2017). Back-translation is considered a functional approach to achieving linguistic equivalence (Brislin, 1970; ITC, 2017). To accomplish this task, the instrument is first translated into the target language by one qualified person and then separately translated back to the source language by a different qualified person. If the back-translated version closely resembles the original version, this step is considered complete. However, some researchers have recognized that cultural differences can make this process challenging for the translated instrument to properly serve the intended purpose (van Widenfelt et al., 2005).
To accomplish linguistic equivalence requires a pragmatic application of theoretically sound translation techniques. This is particularly important when the target language (e.g., Chinese) and the source language (e.g., English) belong to two language families that are drastically different in linguistic dimension, semantics, and syntactic structure (Aaronson & Ferres, 1987). In addition, the more distant the two languages are, the more likely that the cultural systems of their respective societies are significantly different as well. For instance, Chinese (Mandarin) is one of two languages that are most distant from English (Chiswick & Miller, 2008). Correspondingly, Chinese culture and American culture stand at the respective ends of the collectivism-individualism spectrum (Chiao & Blizinsky, 2010). As such, translating and back-translating an instrument between English and Chinese require a careful consideration of the cultural appropriateness of the items.
Furthermore, even after linguistic equivalence has been established, issues pertaining to construct validity, measurement equivalence between the original instrument and the translated instrument may still be present for three main reasons (Byrne, 2016; Hui & Triandis, 1985; Peña, 2007). First, it has long been argued that behaviors are organized, shaped and structured by a society’s cultural norms and values (King, 1978). Different societies may create different conditions that foster the development of different behavioral patterns. As such, certain behaviors may be more likely to occur to individuals in one society than in another. Accurately measuring those constructs would thus require measurement items to be operationalized accordingly. This can be a major challenge and a threat to achieving measurement invariance.
Two, the latent factors that constitute a psychological construct may not align perfectly between two different cultures (Byrne, 2016; Hulin, 1987). For instance, the concept of happiness is culturally defined and factors that contribute to happiness (or the lack thereof) are inevitably culturally bound (Uchida et al., 2004). As a result, a translated happiness measure that was developed and validated in one culture might appear to have construct validity but may contain statements that are not meaningfully aligned to measure happiness in participants of another culture, while lacking items that authentically measure happiness in another language. Sometimes alignment issues related to the underlying psychological construct incidences may present “false positive” construct validity results.
Finally, three, some psychological constructs that are apparent in one culture may not be as apparent in a different culture (van de Vijver & Hambleton, 1996; van de Vijver & Leung, 2000). It is common for an established measure to have been developed with little consideration for its future cross-cultural use. Subsequent attempts for cross-cultural application may be faced with challenges regarding the validity of construct and interpretability of results (Byrne & Campbell, 1999; Korabik & van Rhijn, 2018; Peña, 2007). If a construct does not have a corresponding equivalent in the target culture, construct validity and measurement equivalence issues may arise (Korabik & van Rhijn, 2018). For instance, the concept of self-esteem does not seem to have an equivalent within the Chinese language. The closest concept may be self-respect (自尊). In this case, the problem is usually not with translation; rather it is with differences in the conceptualization of self-esteem between the two cultures (Bond, 1986). For this reason alone, testing measure equivalence is essential to ensuring that the original and the translated versions are assessing the same construct. Otherwise, cross-cultural comparison results can become biased (Cheung & Rensvold, 2000; van de Vijver & Poortinga, 1997).
Despite that for over a decade, Peña (2007) has been calling for research in child development to use measurement and statistical strategies to ensure that instruments used in cross-cultural research have sound psychometric, factorial, construct validity, and measurement equivalence; the calls have not been heeled by most existing cross-cultural studies in the field of child and adolescent behavioral health. One possible explanation of this reluctance may lie in the assumption that achieving linguistic equivalence is sufficient for cross-cultural comparisons. This assumption might be reasonable when two languages are not very different. For languages/cultures that are different in fundamental ways such as Chinese/English and interdependent culture/independent culture, linguistic equivalence may not be easy to achieve and may not be enough. Recently, Byrne (2016) further highlighted how measurement non-equivalence issues stemmed from various sources might create different challenges for cross-national comparisons. In this study, we followed the same line of inquiry outlined by Byrne (2016) to illustrate some of the challenges with behavioral adjustment data that we collected from youth in the United States with the third and newest edition of the Self-Report of Personality of the Behavioral Assessment for Children (BASC-3-SRP) and youth in China with the Chinese Translation of BASC-3-SRP. Our main goal was to demonstrate the need to go beyond linguistic equivalence, internal consistency, and construct validity to test measurement invariance in cross-cultural research.
Method
Research Design
For the purpose of this study (IRB# Pro00035936), we selected the most recently released edition of a comprehensive behavioral health measure that has been commonly used in U.S. schools to evaluate students’ adjustment: The third edition of the Behavioral Assessment System for Children (BASC-3). We administered the Chinese translation to a large sample of Chinese students and the original version to a large sample of American students with the main goal of empirically illustrating issues relating to achieving measurement invariance.
BASC-3-SRP
Behavior Assessment System for Children, Third Edition (BASC-3; Reynolds & Kamphaus, 2015) includes a set of measures on the behaviors and emotions of children and youth. One of the assessments in BASC-3, the Self-Report of Personality (BASC-3-SRP), assesses the behavioral and emotional problems in adolescents from 11 to 21 years old. Studies have demonstrated that the earlier versions of the BASC-SRP had good psychometric qualities for youth in the United States (Frick et al., 2010; Weis & Smenner, 2007) and South Korea (Ahn & Ebesutani, 2015; Ahn et al., 2014).
The BASC-3-SRP for youth includes 189 items (59 binary items: True/False and 130 ordinal items: Never, Seldom, Often, and Always). Of the 189 items, 151 are used to form four composite scales that included 16 subscales: School Problems (three subscales), Internalizing Problems (seven subscales), Inattention/Hyperactivity (two subscales), and Personal Adjustment (four subscales). The first three composite scales measure maladjustment, with a higher score indicating poorer adjustment. The fourth composite scale measures favorable adjustment, with a higher score indicating better adjustment. The sum of item scores for each subscale was used to reflect the youth’s self-reported adjustment.
Of the 38 items that are not used to form the four composites, 34 items are used to form three validity indices to help identify responses that may lack validity: The F Index (15 items; for example, someone wants to hurt me) is designed to detect if a respondent was overly negative (i.e., faking bad), the L Index (15 items; for example, I never get into trouble) detects if a respondent was overly positive (i.e., faking good), while the V Index (four items; for example, I have not seen a car in 6 months) detects if a respondent was non-cooperative. Two additional validity indexes, Consistency Index and Patterned Responding, are designed to detect random and insincere responding. The Consistency Index identifies dissimilar responses to items that should be responded similarly, while Patterned Responding is used to identify responses that likely indicate a lack of validity because the responses were “patterned” (e.g., selecting the same response for all questions).
Translation and Back-Translation
Recognizing that achieving linguistic equivalence between English and Chinese was challenging, we took a stringent panel approach with Chinese and English bilinguals in the translation and back-translation process, following the guidelines on test adaptations from the ITC (2017). The process of generating a Chinese translation of BASC-3-SRP took 6 months. Specifically, the translation and back-translation were conducted by two separate groups, instead of the conventional approach of relying on one group. Group 1 included one Chinese PhD student who was studying Teaching English as a Second Language in the United States and two PhD students who were studying measurement and statistics in the United States. Group 2 included one Chinese PhD student who was studying School Psychology in the United States and one Chinese PhD student who was studying Counseling Psychology and Education in China. The two groups did not know each other and were led by one of the co-authors who was systematically trained in English-Chinese translation and Psychology. We followed five steps proposed by Harkness (2003): Translation, Review, Adjudication, Pre-testing, and Documentation.
Step 1: Instrument translation and back-translation
In translating SRP-BASC-3 from English to Chinese, Nida’s (1964) dynamic equivalence theory, the recommendations by Su and Parham (2002), and the ITC (2017) were used to increase the likelihood that the translated version was not only linguistically but also culturally congruent with the Chinese youth’s experiences. The English to Chinese translation was first completed by at least one member within each group and then the back-translation was done cross-group. Based on the recommendation by Brislin (1970), within each group, the task of back-translation was performed by a student who had not seen the original instrument. The translation was done pragmatically such that all necessary translation techniques were utilized to maximize the possibility of linguistic equivalence between the Chinese version of the instrument and the original instrument. One example of pragmatic translation is the English expression “the apple of my eye.” To translate it into Chinese literally would make no sense to Chinese speakers but it makes perfect sense if it is translated pragmatically into the Chinese idiom “the pearl in the palm,” which is exactly what “the apple of my eye” means to English speakers. For the BASC-3, most of the items can be directly translated and back-translated between English and Chinese to accomplish linguistic equivalence. For several items, we used Chinese idioms or culturally relevant sayings or expressions to accurately reflect what the original statement intended to assess. For instance, Item 25 (Nothing ever goes right for me) was translated pragmatically into a Chinese statement that includes an idiom “事与愿违” (what happens is the opposite of what one wants). Overall, the translation was done systematically and programmatically to ensure that the target measure had no issues with linguistic equivalency.
Step 2: Review of translated instrument
Collectively, the research team reviewed each group’s back-translated versions of the BASC-3-SRP assessment. The review process involved comparing item-by-item of the two groups’ translations. When different translations occurred, a translation that was more appropriate was selected. The translated measure was also compared against the original instrument to ensure that the reverse-scored items were translated as such so that there would be no disruption with the scoring of the items.
Step 3: Adjudication
If an issue arose (e.g., lack of clarity), the senior scholar served as the adjudicator for a final decision. The process continued until all issues were addressed and a final version was created.
Step 4: Field testing
We field-tested the instrument with 12 Chinese college students, six from a university in Southern China and six from a university in Northeastern China. Each student was provided the translated instrument, as well as a feedback form. Based on their comments, further revision was made to Item 92 (My teacher is proud of me). Feedback suggests that a student would have no knowledge that his or her teacher is proud of him or her within the Chinese classrooms. We slightly rephrased this item and translated it into the statement “老师在课堂上表扬我” (My teacher praises me in class), to best capture what the original statement was intended to measure.
Step 5: Finalizing the translated instrument
We shared the finalized version of the BASC-3-SRP with one of the authors of BASC-3 assessment.
Overall, these procedures provided us with a high degree of confidence of linguistic equivalence between the original instrument and its Chinese translation.
Participants
The Chinese sample (N = 1,585) was recruited from middle and high schools in a middle-sized city in Western China, one university in a large city in Northern China, and one university in a large city in Southern China. In recruiting the participants, attention was paid to identify a participant pool that varies greatly in socioeconomic status (SES). Prior to data collection, teachers were contacted for cooperation to recruit participants. The teachers were provided with a brief explanation of the purpose of the project and were asked to distribute a survey link to their students. Students who were interested in participating in the study were instructed to access the survey online and were given the option to check “I do not wish to participate” or to stop during any part of the survey without any negative consequences.
The participants in the United States (N = 573) included a group of youth who were recruited from public and private schools and several universities in a Southeastern State (n = 317), as well as a group of participants who were born in China but were adopted in infancy and are growing up in the United States (n = 256). In recruiting the American participants who were not adopted, close attention was paid to recruit from schools that were located in communities that varied widely in SES and racial diversity. Specifically, schools in two county districts were rank-ordered according to medium household incomes of the zip codes of the schools. Then, 10 schools from each medium income quartile were randomly selected to target for recruitment. With the cooperation of 34 schools, information about the study was distributed to the students by their teachers. Students were given information about the study and the informed consent form for their parents to sign if they were interested in participating in the study. Upon receiving parental consent and student assent (for those under 18 years) or students’ consent (for those who were 18 years old or older), data were collected individually at the participant’s school.
The group of American youth who were born in China but are growing up in the United States were from a larger longitudinal study on American families that had adopted children from China. In recruiting these adopted youth, an email was first sent to parents who had previously participated in the study to obtain basic information about the number of children they had and each child’s age. For families with children who were 11 years or older, the parents were asked if they would consent to a child survey prepared for their children. Surveys were requested for 385 children. A personalized survey link was then sent to the parents, along with a request for the parents to pass the link along to the child. Most of these youth were girls because the vast majority of Chinese children available for international adoption are girls.
Data Analysis Plan
Prior to data analysis, we consulted one of the authors of BASC-3-SRP (i.e., the author whom we had shared the translated version with) to obtain a recommendation on which validity indices to apply to identify potentially invalid responses. Following his recommendation, we applied the V Index quality (which included four items that were most likely untrue or would rarely happen; for example, I have just returned from a 9-month trip on an ocean liner). Participants who scored 3 or higher (i.e., selected “always” in one particular item or selected “True” for two or more of the other three items) were excluded. This step resulted in a usable sample of 2,086 youth: 1,574 in China (Male: 514, female: 1,060) and 512 in the United States (male: 79, female: 433). Finally, the sample contains less than 1.4% of missing data in each item, and no pattern of missingness was observed.
Internal consistency
SPSS version 22.0 was used to investigate the internal consistency of BASC-3-SRP and the translated instrument with American and Chinese samples, respectively. We evaluated the internal consistency of the 16 subscales using Cronbach’s alphas. Pairwise deletion method was adopted to handle the missing data.
Construct validity
To evaluate the construct validity of each subscale and the composite scale, confirmatory factor analysis (CFA) was conducted using Mplus 8 (Muthén & Muthén, 1998-2017). Pairwise deletion and robust weighted least squares estimation were adopted by default due to the categorical nature of the responses. Given that the proportion of missing data is very small (less than 1.4% per item) and no specific pattern of missingness emerged, the impact of pairwise deletion on the results is expected to be inconsequential. Three model fit indices can be used to infer construct validity: chi-square fit statistic, comparative fit index (CFI) ≥.90, and root mean square error of approximation (RMSEA) ≤.08 (Bentler & Bonett, 1980; Brown, 2006). Because chi-square fit statistics tend to be highly sensitive to large sample sizes (Kenny & McCoach, 2003), in the current analysis, we mainly relied on CFI and RMSEA to evaluate construct validity.
Measurement invariance
Measurement invariance is a prerequisite for cross-cultural comparisons (Byrne, 2016). According to Millsap (2011), measurement invariance holds when participants who are at the same level of the latent construct endorse the same response category on the observed measure, regardless of the groups that they belong to. To determine whether BASC-3-SRP items were interpreted in a conceptually similar manner between Chinese youth and American youth, we followed the recommendations of Muthén and Muthén (1998-2017) to perform a multi-group CFA (MGCFA) to determine measurement invariance with Mplus 8. A MGCFA allows researchers to examine whether participants from the two groups interpret items from the same measure in a conceptually similar way (Bialosiewicz et al., 2013). If a subscale failed to demonstrate measurement invariance, we then performed alignment tests to identify the degree of non-invariance and the source of non-invariance.
To determine measurement invariance, we constructed, for each subscale, a hierarchical set of three models: a configural (baseline) model, a metric model, and a scalar (threshold) model. The configural model is the baseline model, in which loadings and thresholds are freely estimated. It is used as a reference model for subsequent model comparisons. The configural invariance test determines if the overall factor structure of the measure fits well for different groups. The configural model assumes that the factorial structure is identical but the parameter estimates are allowed to be different between groups with minimal constraints for model identification. If the configural modeling exhibits a good fit (i.e., CFI ≥.90 and RMSEA <.08) (i.e., the configural invariance is established), the patterns of factor loadings between the Chinese youth and the American youth are equivalent.
The metric model determines if the factor loadings are equivalent across the groups. It assumes that the relations between items and corresponding latent variables are identical in the two groups. In the metric models, the loadings were constrained to be equal between the two groups, while the thresholds were not constrained. If there is no difference in model fit between configural and metric models, then it can be concluded that metric invariance is established.
The scalar (threshold) model was used to determine threshold invariance between the two groups, by constraining all thresholds to be equal across two groups. This model determines whether item intercepts are equivalent across groups. Similarly, if there is no difference in fit between metric models and scalar (threshold) models, then the scalar (threshold) invariance is established. Because the metric model is nested within the configural model, and the scalar (threshold) model is nested within the metric model, we applied the criteria for nested model comparisons: the more parsimonious model fits better when the decrease of CFI and the increase of RMSEA (ΔCFI and ΔRMSEA) are equal to or less than .01 and .015, respectively (Chen, 2007; Cheung & Rensvold, 2002). For chi-square difference tests, DIFFTEST in Mplus with weighted least square estimator with means and variances adjusted (WLSMV) with Delta parameterization was used because a regular chi-square difference test could not be conducted with the chi-square values obtained from the WLSMV estimation.
Finally, because exact invariance (i.e., all factor loadings and thresholds are identical between groups) could be too stringent for cross-cultural comparisons (van de Schoot et al., 2013), we performed alignment tests to examine approximate invariance for subscales that did not show metric or scalar (threshold) invariance. An alignment test shows the degree of non-invariance and identifies non-invariant parameters. The analysis begins by constructing a configural model with factor means set to 0 and variances to 1 for both groups, then frees the means and variances to increase alignment in the factor loadings and thresholds. Because an alignment test is based on results from the configural model, we only performed alignment analyses for subscales that showed configural invariance but not metric and scalar (threshold) invariance. A non-invariance rate of less than 25% was used as a rule of thumb for approximate measurement invariance (Asparouhov & Muthén, 2014; Flake & McCoach, 2018). That is, when a subscale had a non-invariance rate of 25% or less, we considered the construct comparable between U.S. and Chinese samples. This practice is similar to establishing partial invariance when the exact invariance is not satisfied, although approximate invariance and partial invariance are conceptually and statistically very different. Partial invariance also allows factor mean comparisons between groups by relaxing non-invariant parameters to be different between groups when there are at least a number of invariant items (Byrne, 2006).
Results
Descriptive Statistics
Table 1 presents descriptive statistics for the 16 subscales of the BASC-3 in the Chinese and U.S. samples. The scores of each subscale in the Chinese sample were approximately normally distributed with skewness ranging from −0.76 to 0.86 and kurtosis ranging from −0.47 to 0.85. For the U.S. sample, the skewness ranged from −1.28 to 1.76 and kurtosis ranged from −0.50 to 3.29, both of which are acceptable. Because some items’ wordings were very similar, the errors of such items were correlated (e.g., item 11 “I often feel sick in my stomach” with item 56 “My stomach gets upset more than most peoples”) for participants in both countries. Cronbach’s alphas were acceptable based on the criterion α >.70, demonstrating good internal consistency of the 15 subscales for both groups, except for the subscale external locus of control (α = .64) in the Chinese sample.
Descriptive Statistics for BASC-3-SRP Subscales for the Chinese and U.S. Samples.
Note. BASC-3-SRP = Self-Report of Personality of the Behavioral Assessment for Children.
CFA
As shown in Table 2, the CFA models for the 16 subscales revealed an acceptable model fit (i.e., CFI >.90 and RMSEA <.08), thus providing support for their construct validity.
Model Fit for BASC-3-SRP Subscales for the Chinese and U.S. Sample.
Note. BASC-3-SRP = Self-Report of Personality of the Behavioral Assessment for Children; CFI = comparative fit index; RMSEA = root mean square error of approximation; χ2(df) = chi-square value (degree of freedom); CI = confidence interval.
Measurement Invariance
The model fit statistics of configural, metric, and scalar (threshold) invariance models for each subscale are presented in Table 3. The configural models of seven subscales (i.e., negative attitude toward school, negative attitude toward teachers, atypicality, sense of inadequacy, hyperactivity, self-reliance, and self-esteem) failed to fit the data adequately (CFI <.90 and RMSEA >.08). Among the seven subscales, self-reliance (CFI = 0.96 and RMSEA = 0.09 for configural invariance) showed improved model fit with more restrictions (metric model: CFI = .98 and RMSEA = .08; threshold model: CFI = .97 and RMSEA = .07). Thus, we considered the self-reliance subscale met the threshold invariance. Hence, in total, six subscales did not satisfy configural invariance.
Summary of Measurement Invariance Testing With Configural, Metric, and Scalar Models Between U.S. and China Sample.
Note. 1: negative attitude toward school; 2: negative attitude toward teachers; 3: sensation-seeking; 4: atypicality; 5: somatization; 6: sense of inadequacy; 7: social stress; 8: external locus of control; 9: Depression; 10: Anxiety; 11: attention problems; 12: Hyperactivity; 13: self-reliance; 14: relationship quality with parent; 15: self-esteem; and 16: peer relations. The subscales that met the scalar (threshold) invariance are bolded. CFI = comparative fit index; RMSEA = root mean square error of approximation.
For the nine subscales that configural invariance was satisfied and the self-reliance subscale, metric invariance was then tested. Although the p-value of the DIFFTEST, except for Somatization (p = .15), was smaller than .05, indicating that the configural model fitted the data better than the metric model, based on the values of ΔCFI and ΔRMSEA between configural and metric models, nine of the 10 subscales held the metric invariance, except for sensation-seeking (ΔCFI = 0.02 and ΔRMSEA = −0.01). Then, scalar (threshold) invariance was examined for the nine subscales. Results showed that seven of the nine subscales, except for external locus of control (ΔCFI = 0.02 and ΔRMSEA = 0) and peer relations (ΔCFI = 0.02 and ΔRMSEA = −0.01), demonstrated the scalar (threshold) invariance.
Alignment tests were then conducted for sensation-seeking, external locus of control, and peer relations. The proportions of non-invariance for loadings by items of the three subscales between U.S. and Chinese groups were 11.11%, 12.50%, and 22.22%, respectively, smaller than the conventional upper limit of non-invariance (25%). Similarly, the proportions of non-invariance for thresholds by items were 20.00%, 18.75%, and 19.05%, respectively. Thus, the results of alignment tests indicated that approximate loading and threshold invariance were satisfied for the three subscales. Therefore, we considered the means of these three subscales between U.S. and Chinese samples could be meaningfully compared.
In sum, our analyses showed that of the 16 BASC-3-SRP subscales, six (i.e., negative attitude toward school, negative attitude toward teachers, atypicality, sense of inadequacy, hyperactivity, and self-esteem) did not demonstrate sufficient evidence for measurement invariance between Chinese youth and American youth. For the other 10 subscales, there was sufficient evidence that measurement invariance held for the two groups.
Discussion
The main goal of our study was to illustrate the need for an additional step of measurement invariance testing to ensure that the two versions of an instrument are measuring the same constructs in the participants from two different cultures. As demonstrated by results from our data, even after linguistic equivalence was achieved through a translation and back-translation protocol that was more stringent and rigorous than what is typically done, and after construct validity of the measures was established, an important issue pertaining to measurement invariance remained. That is, six of the 16 subscales did not show measurement invariance because they did not satisfy the baseline configural invariance. Because items for the six subscales that failed to demonstrate measurement invariance are not more difficult to translate and back-translate than the items for the 10 subscales that demonstrated measurement invariance, item translation is unlikely the cause of the measurement invariance. Rather cultural differences and other explanations are more likely contributing to the measurement invariance. Among the six subscales that lacked measurement invariance, Negative Attitude toward School, Negative Attitude toward Teachers, and Self-Esteem have been documented to be strongly culturally dependent. As such, our findings corroborate existing research on cross-cultural differences in important psychological constructs that influence youth socialization. In terms of attitude toward school and teacher, teacher–student interaction is often defined by the cultural context. According to Hofstede (1986), there are differences between collectivistic societies and individualistic societies in expected teacher–student interactions because of differences in the social positions of teachers and students. In looking at the items that BASC-3-SRP uses to measure students’ attitudes toward school and teachers, most of the items focused on the social aspects of school, rather than the instructional aspects of schooling and teacher-student interactions (e.g., I feel safe at school; I get along with my teacher). These items might not be the most influential on Chinese youth’s attitudes toward school or teachers. Thus, how BASC-3-SRP operationalizes students’ attitudes toward school and teachers might not be culturally appropriate for the Chinese youth. Similarly, studies have documented challenges of applying the Chinese translation of highly regarded self-esteem measures (e.g., the Rosenberg Self-Esteem Scale) to the Chinese participants. For instance, the Rosenberg Self-Esteem Scale was met with a major challenge when the Chinese translation was administered among Chinese participants, with Item 8 (I wish I could have more respect for myself) showing a poor model fit (e.g., Bush et al., 2002; S.-T. Cheng & Hamid, 1995). In addition, contrary to the Western conceptualization that self-esteem is a unidimensional construct, some scholars have argued that in collectivistic societies, self-esteem has a positive dimension, and dialectically, a negative dimension (Farh & Cheng, 1997). Furthermore, in collectivistic societies, youth typically do not strive to develop an overly positive view of self (Heine & Lehman, 1997) and are usually comfortable with an ambivalent self-evaluation (Spencer-Rodgers et al., 2004). As such, the conceptualization of positive self-evaluation as the defining feature of self-esteem might not be consistent with a collectivistic notion of self-esteem, which emphasizes modesty, humility, and consideration for others’ feelings (Bond, 1986). In fact, some researchers have suggested that in collectivistic societies, positive self-evaluation without a balancing negative self-evaluation is a type of maladjustment (Bond, 1986). In addition, according to Kwan et al. (2009), in both Chinese and American culture, high self-esteem is derived from positive perception of self and others, attributional style, significant accomplishments, and favorable self-perception. However, the relative importance of these sources differs between China and the United States. As such, items in a self-esteem measure developed in the United States might not be balanced properly in the three dimensions to be used in the Chinese society.
Interestingly, lack of measurement invariance may also occur to constructs that are not known (yet) to be incongruent with the collectivistic culture. In our study, we found the subscales of Atypicality, Sense of Inadequacy, and Hyperactivity to have good linguistic equivalence, internal consistency, and construct validity and yet still failed to demonstrate measurement invariance. In assessing Atypicality, BASC-3-SRP has 10 items that focus on symptoms of paranoia (e.g., someone wants to hurt me), hallucination (e.g., I hear things that others can’t hear), and obsession (e.g., I do things over and over and can’t stop). In assessing Sense of Inadequacy, the 12 items of BASC-3-SRP largely focus on behaviors that reflect sense of helplessness (e.g., I want to do better but I can’t). In assessing behaviors of Hyperactivity, BASC-3-SRP includes two categories: difficulties with regulating amount of talking or timing of talking and difficulties with kinesthetic regulation (motor activity). These operationalizations do not seem to deviate much from the Chinese cultural norms. Nonetheless, some have argued that because societal expectations play an important role in the conceptualization of socially undesirable behaviors such as behaviors of hyperactivity, maladaptive behaviors should be understood as cultural constructs (Kirmayer & Bhugra, 2009; Timimi & Taylor, 2003). For instance, within the literature, hyperactivity and disruptive behaviors in children indeed differ significantly across cultures (Mann et al., 1992). This may suggest that the items used in the BASC-3-SRP might not be capturing the behavioral manifestations of hyperactivity among Chinese children and youth. This might be due to that some of the items were not applicable to Chinese school settings as much as they do in Western schools. In addition, cross-cultural differences exist in the perception of the causes of mental health problems such as sense of inadequacy and atypical cognition (e.g., hearing voices that others can’t hear) (Abdullah & Brown, 2011). It is possible that these differences may contribute to how Chinese and American youth responded to the questions measuring these constructs. Recently, Byrne (2016) has highlighted that measurement invariance issues may arise due to methodological biases (e.g., sample bias, differential responses to the measurement items between groups, or between group difference in how the measurement is administered) and issues with how the same items may elicit different responses from participants from different groups, as well as construct biases where items used to measure a construct do not completely overlap with behavioral expressions of that construct in different cultures. It is possible these problems may have contributed to the lack of measurement invariance between the Chinese sample and the U.S. sample. However, it is beyond the scope of this study to delve deep into this. More research will be needed to determine why these constructs lack measurement invariance despite conceptual similarities and cultural differences that do not seem apparent.
Finally, our analysis unexpectedly showed that although the concept of locus of control has clear cultural difference between individualistic culture and collectivistic culture (C. Cheng et al., 2013), measurement invariance was supported in our data. The idea that a construct with well-established cultural difference can demonstrate measurement invariance has not been commonly discussed within the literature. If future research replicates our finding, this would further complicate research that focuses on cross-cultural comparisons. Within a collectivistic society, normative locus of control tends to be more external than in individualistic society (Hamid, 1994), while external locus of control is not perceived as negatively in collectivistic societies as in individualistic societies (C. Cheng et al., 2013). In other words, externalizing locus of control within the Chinese cultural context does not imply poor adjustment. In fact, Chinese individuals are more likely to perceive and accept that external forces have control over their lives than people in individualistic societies and are less likely to see that as problematic (C. Cheng et al., 2013). However, in BASC-3-SRP, an external locus of control is conceptualized as one of seven subtypes of internalizing problems. The eight items (e.g., what I want never seems to matter) that the BASC-3-SRP uses to measure external locus of control are primarily informed by external attributional style. While this operationalization itself is conceptually problematic within the Chinese cultural context, evidence of measurement invariance raises an important question about the meaningfulness of comparing the results between Chinese youth and American youth because externalizing locus of control is not necessarily an indicator of maladjustment within the Chinese society in the first place.
Overall, our study demonstrates that establishing linguistic equivalence and construct validity do not warrant measurement invariance because measurement invariance and the lack thereof may occur to constructs that are deeply rooted in one cultural system but not in another (e.g., self-esteem and attitude toward teacher and school), as well as constructs that do not seem to have a strong cultural difference (e.g., atypicality and sense of inadequacy). Some of the problems may lie in instrument development. For instance, in the process of generating a pool of potential items for a given construct, more attention could have been devoted to identifying items that have a broader cross-cultural interpretability. Furthermore, representations of scholars from different cultural backgrounds likely will help reduce challenges in developing instruments that lack sound cross-cultural utility. We recommend that future instrument development should have a stronger emphasis on potential cross-cultural applications.
Implications and Limitations
To achieve measurement invariance, both the constructs and the measurement instruments need to operate equivalently across cultures (Byrne, 2016). Because the Chinese culture and the U.S. culture differ significantly in how self is constructed (Bond, 1986), it may be a tall order for the BASC-3-SRP to demonstrate measurement invariance in all 16 constructs in the Chinese sample. Our finding that 10 of the 16 subscales demonstrated measurement invariance suggests that it is possible to develop measures that have cross-cultural utilities. Nonetheless, having evidence of the lack of measurement invariance for the six subscales prompts us to ask how instrument development can possibly avoid the issue that some psychological constructs are deeply rooted in one cultural system but not as deeply rooted in another culture. As such, adequately measuring them is not feasible without making changes to their latent factors, which raises interpretability questions for cross-cultural comparisons. This is an issue that the field of cross-cultural psychology will likely continue to grapple with. More research is urgently needed to probe deeper into how to meaningfully measure cross-cultural similarities and differences.
Findings from our study need to be interpreted with its limitations. Because our data were obtained from volunteers, we do not know if the results are generalizable to other youth. We field-tested the Chinese translation of the BASC-3 with college students only. Not field-testing it with younger youth could be a limitation of this study.
Footnotes
Acknowledgements
The authors would like to acknowledge Sy-Woei Hao, a PhD student in the Educational Psychology program at the University of South Florida for her proofreading of the final version of the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The project was supported in part by the University of South Florida Nexus Initiative Award and by the University Of South Florida College of Education Mini Grant.
