Abstract
Measures of the same phenomenon should produce the same results; this principle is fundamental because it allows for replication—the basis of science. Unfortunately, measures of a psychological construct in one language can often measure something a bit different in another language (i.e., low “scale equivalence”). Historically, the problem was thought to stem from insufficient knowledge of best-practice translation procedures. Yet solutions based on this diagnosis and their widespread adoption have not resolved the issue. In this article, we suggest that an additional problem might be insufficient information about the measure being translated. If so, low scale equivalence is a problem that translators and cross-cultural psychologists cannot solve on their own. We explore the possibility that measure-specific translation guides be created by original scale builders for the most widely used measures of important psychological constructs. We describe why such guides are needed, when they are needed, what they might look like, their feasibility, and next steps, providing a complete example guide and test case in a supplement concerning the Primals Inventory. In this article, we seek to spark discussion on translation practices happening behind the scenes and how greater transparency can improve scale equivalence, in the spirit of open science.
Scale building in psychology consists of two main tasks: writing items and empirically validating them (e.g., DeVellis, 2016). Empirical validation involves activities familiar to scientists generally, such as data collection and statistical analyses, with sufficient methodological transparency to allow replicability. Item writing, in contrast, can be a messy wordsmithing exercise. It requires skills that overlap less with those of other scientists and more with those of philosophers, poets, and marketers. Although informed by standardized techniques (e.g., cognitive interviewing; Willis, 2004), item-writing decisions are ultimately subjective, and descriptions thereof can feel out of place in scientific articles. Indeed, most scale-building articles spend pages on empirical item validation and a scant line or two on item writing, if any (e.g., the Satisfaction with Life Scale: Diener et al., 1985; the Primals Inventory: Clifton et al., 2019). This opacity on item-writing decisions, despite its central importance, leads to three predictable consequences: First, item-writing skills are undervalued (e.g., Sheatsley, 1983). Second, they are harder for researchers to develop despite excellent articulations of best-practice item-writing principles (DeVellis, 2016; Sheatsley, 1983).
Third, and most relevant here, any attempt to replicate item writing in a new cultural or linguistic context involves considerable guesswork, leaving scale translators/adaptors 1 in a lurch. Although it is easy for translators to run the same empirical-validation process (e.g., the same exploratory factor analysis with the same rotation) on items written in different languages—just “plug and chug”—how to replicate an item’s communicated message in a new language, pinpointing the same phenomenon at the same item-difficulty level (i.e., good item translation), is exceedingly challenging (e.g., Matsumoto & van de Vijver, 2011). Ironically, this is precisely where original scale-building articles offer the least guidance. If scale builders do not detail the pitfalls that they took pains to avoid in the creation of items, none should be surprised when scale translators fall into these pits unwittingly. With this article, we seek to spark discussion on the item-writing and translation processes happening behind the scenes and how greater transparency can improve scale equivalence, in the spirit of open science.
The Literature Gap
Two existing guidance literatures
To improve item writing during translation, two loosely defined guidance literatures have emerged (Table 1), although sometimes a single article provides both guidance types (e.g., Hambleton & Lee, 2013). Both are written by scale-translation-methodology experts, distilling insights applicable across scales—what we call “scale-generic”—from the broader translation-methodology literature. They are meant to aid individuals taking on particular translation projects, who are often not translation-literature experts. Both types are explicitly motivated by the goal of improving scale equivalence, defined as “the level of comparability of measurement outcomes,” in cross-cultural research through improved translation practices (Matsumoto & van de Vijver, 2011, p. 8; e.g., Beaton et al., 2000).
Three Translation-Guide Literatures Aiming to Improve Scale Equivalence
Only in comparison because scale-generic content guides alert the translator to issues that may or may not be relevant.
The first type is scale-generic process guides. These provide step-by-step instructions on how to execute a standard translation process (e.g., Beaton et al., 2000; Borsa et al., 2012; International Test Commission, 2017; van de Vijver & Hambleton, 1996). They are user-friendly because the vast majority of steps are relevant to the vast majority of translation efforts. However, they provide little guidance about how to translate actual scale items (i.e., picking the best words).
To that end, the second type is scale-generic content guides (e.g., Hambleton & Zenisky, 2011; Johnson et al., 2011), which alert translators to general challenges they might encounter while translating item content. For example, Hambleton and Zenisky (2011) compiled 25 issues often faced by scale translators, such as when a colloquialism in the source language has no equivalent in the target language. 2 These guides are, comparatively, not user-friendly because principles vary in relevance to particular translation efforts. Scale-generic content guides alert translators to potential issues (e.g., idioms) but help less once an issue is identified (e.g., translating that idiom). Discussion of each principle is brief, so those of critical relevance to a particular translation often require a deeper dive into the methodology literature, which in practice may not occur.
Despite the development of these two literatures to address scale equivalence in cross-cultural research, low scale equivalence remains pervasive. For example, in a systematic review of 26 child and adolescent psychopathology assessment scales, “none of the evaluated scales have strong evidence for cross-cultural validity and suitability for cross-cultural comparison” (Stevanovic et al., 2017, p. 126). Low scale equivalence persists even though scale-generic process guides have roundly succeeded in disseminating a standardized process (Chen, 2008). A recent methodological review of 78 health instruments “revealed that cultural adaptation followed similar steps with few variations” and that “most follow the process based on Beaton et al. (2000)” (Arafat et al., 2016, p. 129). In other words, scale translators have successfully organized themselves to try a solution (although subdisciplines vary) that, seemingly, has not resolved the problem.
Diagnosing the problem
When treatment fails to improve symptoms, sometimes the problem is misdiagnosis. In this case, the value of both guidance literatures is premised on a diagnosis of the core problem as a lack of information about generic translation principles and procedures. A further cause, however, might be lack of information about the particular scale being translated. Indeed, it is a logical leap to expect that scale-generic advice can satisfactorily resolve equivalence issues if most scale-building efforts present both common and unusual challenges. 3 This notion is axiomatic throughout the scale-creation literature, in which generic advice about writing new items is consciously brief (Sheatsley, 1983). As DeVellis (2016) wrote, “Different variables call for different assessment strategies” (p. 7). If so, instead of more widespread adoption of scale-generic translation guides, researchers might try something different.
Current practice
In some ways, we propose simply to extend the way construct- and scale-specific advice is currently transferred from scale creators to translators, making it more formal, detailed, accessible, and incentivized. A caveat is that no empirical metascience research has surveyed translators to outline common practice, and considerable variation in practice exists. However, when researchers discover a new scale in a foreign language relevant to their outcomes of interest, they often start by emailing original scale builders for advice (e.g., Stahlmann et al., 2020), as is encouraged by most scale-generic process guides (including Beaton et al., 2000). What typically happens next has been mentioned only in passing in the literature (Herdman et al., 1998; Wild et al., 2005), but we expect most readers with translation experience will recognize the patterns. Our characterization is based on the anecdotal experience of the coauthors and colleagues around the world. We suspect it is typical of translation efforts in social psychology, personality psychology, positive psychology, and, to a lesser extent, clinical psychology.
In practice, the most common result of initial translator inquiries is no response from the original scale authors, followed in frequency by a brief, informal email suggesting a handful of ideas that occur in the moment (e.g., “Be careful how you translate X”; B. Arnold, personal communication, February 14, 2022). After multiple translator inquiries, more conscientious scale builders begin informally aggregating their notes into “frequently asked questions.” Although helpful, notes are often spotty. They are rarely the product of a careful and thorough review of challenges that emerged during original scale creation or previous translation processes. Original item writing usually happened years prior, and unlike the thoughtfully written, published record of empirical validation efforts, a detailed account of original item-writing choices is unlikely to exist. Many insights are simply forgotten. For example, when asked why a particular phrase was used, original scale creators may remember discussing the phrase with colleagues but not the content of those discussions, leading to responses such as, “I’m not sure why we said it that way, but I think we were probably concerned that . . .” Furthermore, scale builders may not have made clear-eyed item-writing choices to begin with. Herdman and colleagues (1998) noted that translator inquiries often motivate scale creators to “clarify the aims of their original version” (p. 328).
On rarer occasions, in response to numerous inquiries, informal scale-specific translation guidance documents can emerge, with newly remembered details accumulating such that each additional translation team is given some new pieces of advice. Documents are never subject to peer review, although it might lead to increased utility, clarity, or comprehensiveness regarding the most critical translation issues. Unbiased feedback from translators is infeasible, sometimes because of power dynamics between scale creators and scale translators, accentuated when the former is Western and, on average, more direct in communication style (e.g., Sanchez-Burks et al., 2003). Translators often cite personal communication with original scale creators, but reviewers cannot judge whether private guidance was followed. Even at their most developed, such documents remain spartan and, lacking a publication outlet and basic standards for format or rigor, are informally shared among colleagues or self-published on personal websites (e.g., ResearchGate). This is true even for scales translated dozens of times that serve as the foundation for large and important literatures on theoretically universal psychological processes.
For example, the Moral Foundations Questionnaire (MFQ; Graham et al., 2009) measures the five values of moral-foundations theory, a prominent theory in social psychology. According to moralfoundations.org, the MFQ has been translated into 39 languages (checked June 2021), the basis for a large literature on presumably universal values. Anticipating future translations, Haidt (2010) wrote a roughly one-page blog post on MFQ translation issues. It mainly describes how translators might consider the shades of meaning associated with a few terms, such as “respect,” “authority,” and “purity,” and might avoid strict translation. This is both much more guidance than typically provided and much less than we propose. For example, no subscale- or item-specific advice was included. (The link is also now inactive, underscoring access issues.) Iurino and Saucier (2020) recently tested measurement invariance of the MFQ across 27 countries. They concluded, “We were not able to replicate Graham et al.’s (2009) results indicating that a five-factor model is a suitable approach to modeling the moral foundations” (p. 370). All MFQ translations were created via standard translation processes by translators familiar with scale-generic translation-guidance literature (Iurino & Saucier, 2020). It is unclear whether the low scale equivalence resulted from genuine differences in latent variables across cultures or from mistranslation (see Note 3).
Another emerging example involves the Primals Inventory (Clifton et al., 2019). The Primals Inventory measures basic beliefs about the world called “primal world beliefs” (or “primals”), such as the beliefs the world is dangerous, interesting, abundant, and so forth, that are thought to shape many clinical and personality variables (e.g., depression and curiosity). In this case, when original scale authors (including first and senior authors on this article) began receiving inquiries from prospective translators, the same type of short, informal, accumulating ad hoc guidance document described above emerged. The first few translation processes suggested translators were falling into predictable traps that were construct- and scale-specific (described in the Supplemental Material available online) and could evade detection during back-translation. After 12 translators had reached out and substantial international measurement work on primals became likely, the value of detailed, systematic guidance coauthored by original scale creators and initial translation teams became clear.
Motivation to produce a guide, however, was low. As empirical scientists, unmasking numerous subjective decisions central to scale validity is an uncomfortable prospect, let alone providing those considerations as a basis for recommendation. There was also little professional incentive given the lack of quality publication outlet (confirmed afterward by manuscript rejections). Indeed, to our knowledge, no detailed scale-specific translation guide has ever been published by any peer-reviewed psychology journal. Finally, we had little idea what a rigorous, high-quality, translator-friendly, scale-specific guide might look like.
The proposed third guidance literature
In sum, a gap currently exists in the methodological literature: a lack of accessible, quality information on scales being translated. Translators typically cannot reliably learn (a) how and why items were written as they were, (b) which issues from the broader scale-translation literature are most relevant, and (c) how relevant issues might be addressed for that particular scale. To fill this gap, in this article, we discuss how the two existing types of scale-generic translation guides (Table 1) might be supplemented by a third type of scale-specific translation guide. Instead of translation-methodology experts, original scale creators would primarily author these guides, perhaps with input from translation experts and teams. Scale-specific guides would be user-friendly, highlighting only relevant insights from the broader scale-translation methodology literature and walking the translator step-by-step through scale content (including a comprehensive table of items with translation-relevant annotations, as is currently created by translation experts in some medical research; B. Arnold, personal communication, February 14, 2022). Guides would focus on the most critical item-writing decisions for replicating item meaning in a new language. As a rule, these guides would not provide generic translation guidance (e.g., Beaton et al., 2000) but supplemental construct-, scale-, subscale-, and item-specific insights only, which original scale creators are best positioned to provide. In the balance of this article, we discuss (a) conditions justifying the creation of scale-specific translation guides, (b) the form these guides might take (a suggested five-part format is illustrated by a full-length exemplar in the Supplemental Material), (c) the feasibility of creating these guides, and (d) next steps to explore their value empirically. We conclude by categorizing our approach not as a “best practice” but a “better one.” We encourage others to posit testable alternatives, so a best practice, empirically shown to improve scale equivalence, may eventually emerge.
Advice for Creating Scale-Specific Translation Guides
Eight conditions warranting a scale-specific guide
It is neither feasible nor necessary to create scale-specific translation guides for all psychological scales. Likely some or all of the following conditions should be met:
The construct is considered significant (e.g., clinical diagnosis) or universal (requiring cross-cultural research).
Multiple translation efforts are expected.
The construct is atypical conceptually (e.g., novel, abstract, easily conflated with other constructs).
Items involve difficult language, such as atypical or culture-specific diction (e.g., idioms) or troublesome grammar (e.g., reverse-scored items employing double negatives or counterfactuals).
The scale is multidimensional, especially when item-writing decisions were not uniform across dimensions or when translators wish to explore dimensionality in the target culture (Clifton, 2020; Haidt, 2010).
Risk of failing reliability benchmarks (α < .70) is high if just one item underperforms (e.g., subscales with few items or opposite-scored items, which—although important—often misbehave; Tay & Jebb, 2018).
Item-difficulty issues are present (e.g., extensive use of intensifiers such as “very” to achieve normally distributed responses).
Specific translation activities are needed (e.g., piloting, back-translation). Translators must sometimes prioritize, and scale creators can inform prioritization decisions.
In sum, scale-specific translation guides are most needed when a widely used scale measuring a major psychological construct involves scale-specific translation challenges that are unusual and important for scale equivalence.
Suggested format of scale-specific guides
Although various formats are plausible, we suggest scale-specific information be presented in the following five parts:
Scale background
Construct-specific issues
Item-specific issues
Initial translation efforts
Priority concerns
This format was developed through an iterative process of translator review and refinement. Each section is described alongside insights from the full-length exemplar guide to translating the Primals Inventory (see Supplemental Material). These details illustrate the presence of numerous, critical translation challenges specific to the Primals Inventory that scale-generic guides do not address.
I. Scale background
Guides would begin with a brief introduction to the construct and scale, a justification for scale-specific guidance based on the eight conditions above, and a reminder that the guide should supplement, not replace, existing scale-generic guidance. For example, the exemplar defines primal world beliefs (“primals”). It specifies the scale under consideration (the Primals Inventory) and notes that although subjectivity is inherent to item writing, many decisions have basis in piloting or empirical item-response characteristics.
II. Construct-specific issues
Next, a theory-oriented section would highlight the main difficulties the translator will face repeatedly across items (and subscales) given the peculiarities of the construct. Example topics include construct definition, important or repeated words, critical translation activities because of the construct, measurement strategy/philosophy, and empirically confirmed problems to watch for (e.g., skewed response patterns for particular subscales)—all issues that generic guidance cannot reasonably address. The exemplar guide examines eight such issues for the Primals Inventory (see the Supplemental Material):
how to reference “the world”;
navigating repeated use of unusual syntax (e.g., “It feels like . . . ”);
consulting atypical experts;
prioritizing item piloting;
item difficulty problems that vary across subscales;
translating additional items because of unusually short subscales;
including reverse-scored items;
calibrating translation goals given that primals are underexplored.
These eight issues concern both the process of translating items (e.g., Item 3) and the content of items being translated (e.g., Item 1), although the focus is mostly on content. To illustrate why these issues are critical for Primals Inventory scale equivalence but cannot be feasibly addressed by scale-generic guides, we highlight two examples below.
Referring to “the world.”
A major construct-specific issue the exemplar discusses is how to translate the notion of “the world”—the object of belief—which must be referenced in all 99 items but is referred to in more than 35 different ways. For many belief scales, the object of belief is clear enough that a single, familiar term can be used for all items (e.g., “chemotherapy”). But “the world” relevant to primal world beliefs is not simply the planet or a geographic location. It is the broadest psychologically meaningful habitat, which is defined by the respondent. Because all words, including “the world,” imperfectly capture this meaning, systematic error is avoided by referring to the world in many ways, tailored to each item. Piloting confirmed this variety as necessary. For example, the term “world” was adequate when measuring the idea that many things and situations are dangerous (e.g., “On the whole, the world is dangerous”) but not when trying to measure the belief that many things and situations reflect intention (hypothetically, “On the whole, the world is intentional”) because this evokes specifically human connotations. Therefore, a better term for “the world” is more specific to nature (hypothetically, “The universe is intentional”). Table S1 in the Supplemental Material provides translators with all terms used for “the world,” such as “life” and “most things and situations,” and recommends translators develop and reference their own similar bank during Primals Inventory translation.
Translating additional items
Another scale-specific issue the exemplar discusses proved critical in early translation efforts (and latter efforts during which advice was not followed). Compared with other scales in psychology, the Primals Inventory is unforgiving of translation miscalibration primarily because most subscales involve just four or five items. Subscale brevity means that “if even one translated item performs poorly, entire subscales can easily fail reliability benchmarks” (i.e., α < .70; see Supplemental Material). Thus, the guide strongly recommends that for each subscale, a few additional high-performing items from the original item pool be translated, administered, and culled as needed during item analysis. Even when following high-quality translation practices, an item may prove untenable in a new linguistic context and must be dropped. The exemplar specifies recommended additional items from the original item pool based on empirical item characteristics.
III. Item-specific issues
After construct discussions would come a fully pragmatic section discussing each subscale and item in an annotated table, allowing item-specific challenges to remain top-of-mind throughout translation. Table content might include:
(sub)scale concept definition(s);
original (sub)scale items;
item-specific annotations, especially highlighting item difficulty, colloquialisms, and translation alternatives;
additional notes highlighting key challenges.
To illustrate, Table 2 provides information for the subscale measuring the belief the world is “improvable.” Although we group item-specific comments by subscale in the exemplar, separate discussions of each item may typically be preferable.
Excerpt From Table S2 of the Supplemental Material Available Online: Annotated Table of Primals Inventory Items for Translation Purposes
Reverse-scored items.
IV. Initial translation efforts
Each time a scale is translated, much is learned, especially the first few times. A fourth section would describe initial experiences so major challenges can be surfaced. Written primarily by initial translators, this section can facilitate peer dialogue and demonstrate how issues encountered in one effort led to empirically demonstrated improvement in the next. For example, after the first Primals Inventory effort (German) encountered internal reliability problems, steps taken in the second (Italian) ameliorated the concern. Note that this section of the exemplar also broadens a guide’s focus beyond idiosyncrasies in the source language such that translation comments can be more culturally neutral (Acquadro et al., 2018; van de Vijver & Leung, 2021).
V. Priority concerns
A final section would comment on priorities. The time and financial resources of independent translation efforts vary widely. In response to early drafts of the exemplar guide, translators suggested ending with a sense of how to prioritize the many suggestions provided. Thus, the exemplar ends by noting the three most important takeaways: (a) carefully translating “the world,” (b) calibrating item difficulty, and (c) initially administering one or two additional items for some subscales. Without the initial German and Italian translations, these priorities would have been different.
Feasibility Requires Publishability
The main suggestion of this article is that to improve scale equivalence, translators may require increased access to scale-specific information. This may be achieved via the creation of a third translation-guidance literature. On the basis of our experience, in the previous section, we specified a form such documents might take; however, any number of formats might work. Assuming such guides improve scale equivalence (empirical next steps discussed in closing), the more pressing issue is likely feasibility.
A major strength of the proposed approach is that scale-specific translation guides are stand-alone documents that can be written after the original scale-creation effort. They do not depend or impinge on the initial effort to achieve validity in a particular culture. Some worthy proposals for improving cross-cultural scale equivalence have involved writing original scale items that are easy to translate from the start (Acquadro et al., 2018; van de Vijver & Leung, 2021). For example, in creating the Social Axioms Survey II, psychologists from 10 countries generated intentionally “pancultural” scale items and established reliability and validity for those items across countries (Leung et al., 2012, p. 836). Such approaches can be invaluable, but most new scales are never widely used in their initial cultural context, let alone translated for others (e.g., SCImago, n.d.). Thus, most original scale creators best serve the psychology-research community by focusing on creating the most valid measure they can in the original language. In fact, the measurement of some latent variables, such as religious thoughts (e.g., about the “soul”) or political attitudes (e.g., “liberalism”), may require culturally based language. For these reasons, a strength of scale-specific guides is that they offer an avenue to increase scale equivalence once international literatures become likely.
Another strength is that the proposed approach echoes concurrently emerging efforts in some nonpsychology disciplines. For example, even though optimizing original scale items for translation from the outset has become widespread in the international-medical-research community over the past 10 years (S. Eremenco, personal communication, February 14, 2022), scale-specific guidance is still considered a “necessary foundation to improve the semantic and conceptual equivalence of the translations” (Eremenco et al., 2017, p. 7; see also Herdman et al., 1998). Therefore, some large, collaborative, international projects examining the efficacy of medical treatments have created detailed scale-specific documents, similar in many respects to Section III in the guide proposed above (List of Adult Measures, 2021; The WHOQOL Group, 1994). Others, incentivized by regulators such as the U.S. Food and Drug Administration, create proprietary translations including scale-definition documents (e.g., ICON Language Services, D. Weiss, personal communication, February 24, 2022).
One reason these practices remain relatively rare in psychology may be a lack of clarity regarding format and content—hence our specification and exemplar. Another reason, however, concerns the lack of professional incentives. These guides are significant undertakings; the exemplar concerns a long scale, is 50 pages, and remains far from exhaustive. Indeed, the main reason the proposed third literature has not already emerged may be that, absent central planning or regulation as in the medical field, psychologists have little incentive to produce scale-specific translation guides.
Some ways of encouraging the creation of scale-specific guides are obvious. For example, newly published scale-generic guides could recommend consultation of scale-specific guides where available. Other ideas can be readily dismissed, such as appending scale-specific guides to original scale-creation articles (for reasons discussed above) or simply asking original scale builders to write better scale-specific guides and continue publishing them informally (e.g., on ResearchGate). The latter is untenable because researchers are motivated to produce publishable materials; peer-reviewed publication is not just the basis for professional advancement but also for shaping the conversation and ensuring its quality. Slightly better might be to include scale-specific guides in test manuals. However, not all disciplines use test manuals or subject them to peer review.
Therefore, we suggest that the most feasible way to encourage the development of the proposed third literature is to make scale-specific translation guides publishable via a peer-reviewed outlet (journal, journal special section, or edited book series). Reviewers might include construct experts, translation-literature experts, and prospective translators. The next section considers the two main potential objections to this proposal.
Two Potential Objections
Readership
The first main objection to creating a peer-reviewed outlet for scale-specific translation guides is limited readership. We estimate the target audience for a scale-specific guide (translators of that scale) is unlikely to exceed 50 to 100 researchers. However, each one would likely cite the guide, providing a citation floor exceeding the mean for psychology articles (< 1 citation annually; even prominent journals have a 50-year mean of 2.1 annually; Cho et al., 2012; Harzing, 2010).
Secondary audiences are also likely. The article most akin to—although different from—the sort of scale-specific translation guide proposed is an edited chapter titled “Guide to Constructing Self-Efficacy Scales” by Bandura (2006b). It was not created for adapting self-efficacy scales into new languages but new competency domains (e.g., self-efficacy at maintaining an exercise routine) and is therefore not an ideal comparison. Still, it reads much like the exemplar guide, highlighting construct-level issues (e.g., how self-efficacy is different from self-esteem and locus of control), item-wording issues (e.g., “items should be phrased in terms of can do rather than will do” to emphasize capacity; Bandura, 2006b, p. 308), and activities to prioritize (e.g., piloting). Readers not interested in measuring self-efficacy might understandably consider this minutia because, similar to our exemplar in the Supplemental Material, Bandura’s target audience is scale creators for one construct.
Nevertheless, this chapter became one of the most cited articles in self-efficacy (cited 7,483 times, Google Scholar, June 2021). This is poorly explained by outlet prestige. The book had 15 chapters; Bandura’s guide has been cited 6 times more than the next most-cited chapter and 30 times more than the rest. Heavy citations are also poorly explained by authorship because by chance, the next most-cited chapter was also written by Bandura (2006a) and for a wider target audience (developmental psychologists interested in agency). To better understand, we randomly selected 100 citing articles and categorized them by primary reason for citation (4% cited erroneously or were unavailable), as follows:
Forty-six percent cited the guide to discuss conceptual insights into the nature, definition, or function of self-efficacy, for example, “Bandura (2006) understood self-efficacy as . . .”
Thirty-one percent cited the guide to create a new domain-specific self-efficacy scale (target audience).
Sixteen percent cited the guide to evaluate a previously created self-efficacy scale.
Three percent cited the guide to apply Bandura’s methodological insights to other scales or constructs.
In sum, two thirds of citations (65%) were by secondary audiences, especially researchers interested not in measurement but in construct definition. Although we do not suggest that scale-specific translation guides will receive as many citations overall, we can expect some citations outside the target audience from researchers (a) discussing the construct, (b) evaluating previous translations, or (c) applying methodological insights to other scales. If guides are created on touchstone measures of important constructs (i.e., two of the eight specified conditions justifying a guide are met), they would provide deep insight from a top theorist (e.g., Haidt on the MFQ). Impact factors could rise further as outlets become clearinghouses for only those scales editors and reviewers deem worthy of and ready for translation, with translators increasingly avoiding scales without published guides.
Subjectivity
The second main objection to establishing a publication outlet for scale-specific translation guides might be subjectivity. Indeed, various discussions in these guides will be based on pilot evidence, scale-builder speculation, initial translation efforts that may not be representative, and so forth. This concern is reasonable and reflects an appropriately high standard for scientific claims, but it is misplaced. To borrow a nautical analogy, scale-specific guides are not navigational charts claiming to map objective reality, but pragmatic advice from one sailor to the next on what to look out for when navigating a tricky waterway. The task is subjective yet unavoidable. Each translation process involves hundreds if not thousands of decisions about what words to use in the target language. The question is not whether subjectivity is inherent in the field’s scientific methodology but whether psychologists can bring themselves to openly discuss it.
Open discussion would advance the three values of openness, integrity, and reproducibility identified by the Center for Open Science (Nosek, 2017). Increased openness would promote scale-creator integrity by counteracting questionable practices, such as obfuscating or omitting key item-writing decisions (Ferguson, 2015), as well as translator integrity; when decisions are based on “personal communication,” readers cannot discern the advice given, whether it was followed, or how adherence might affect score interpretation. Indeed, 16% of authors citing Bandura’s guide did so to critique the scales and findings of other researchers. Critically, low cross-cultural equivalence can be considered a replication issue in which insufficient item-writing detail precludes reliable replication of item meaning in new languages. If so, addressing it is a step toward ameliorating the broader “replication crisis” (Maxwell et al., 2015). Finally, we add a fourth value: researcher skill. Good item writing and translation are difficult, they take skill, and skill development is aided by a community of experts who openly discuss their actions and rationale.
Next Steps: Testing the Approach Empirically
Before any incentivizing, however, it remains an open empirical question whether such guides would meaningfully improve scale equivalence. More scale-specific information is theorized to be helpful (e.g., Eremenco et al., 2017), but this has not been shown. We suggest research proceed on dual fronts. First, to define the problem further, metascience studies could survey (a) scale translators on their access to scale-specific translation guidance (related to Katan, 2009) and (b) original scale creators on whether such materials were produced. Second, to determine causation, researchers might (a) conduct experiments in which translators adapt a scale with or without a scale-specific translation guide (inspired by PACTE Group, 2008) and (b) review the long-term impact (if any) of guides on the cross-cultural equivalence of translated scales (similar to Stevanovic et al., 2017). The Primals Inventory translation guide may serve as an imperfect test case for the latter.
Concluding Remarks
A perennial difficulty in the social sciences is balancing nomothetic approaches (focusing on averages and general laws) with idiographic concerns (the characteristics of particular cases). Since Freud, psychology has tilted nomothetic (Robinson, 2011). Efforts to resolve low scale equivalence, including translation guides, reflect that leaning. In cases in which those efforts have not worked, however, new approaches should be discussed. Our suggestion is to try something more idiographic. We detailed a particular approach that we believe has been invaluable to our research area, could be helpful to others, and seems feasible (detailed, publishable, peer-reviewed, five-part, scale-specific translation guides). However, the specificity provided should not be interpreted as a recommendation of best practice but, rather, a spark for other approaches and testing. We hope that in due course, the empirical literature matures sufficiently to warrant recommendation of best practice.
In the meantime, our overarching argument is for the value of increased translator access to scale-specific information, in whatever form. Scale building involves two main tasks: writing items and empirically validating them. Both tasks require methodological description, with the necessary level of detail determined by what makes replication possible (e.g., Kallet, 2004). Yet original scale articles typically spend pages describing empirical validation and a sentence or two on item writing. The logical consequence is difficulty reproducing item meaning in new contexts. If scale translation is the effort to replicate items’ received message in a new linguistic landscape (Egger et al., 2003), cross-cultural research is the arena in which the consequences of insufficient item-writing detail should be most pronounced. Of course, some translators succeed anyway, but whenever guesswork misses the mark, scale equivalence suffers. Expecting otherwise is akin to believing experimental studies can forego detailed method sections without replication failures because generic guidance exists for how to conduct scientific experiments. Particularities matter. Not all can be discussed; important ones can be.
Supplemental Material
sj-docx-1-pps-10.1177_17456916221119396 – Supplemental material for Improving Scale Equivalence by Increasing Access to Scale-Specific Information
Supplemental material, sj-docx-1-pps-10.1177_17456916221119396 for Improving Scale Equivalence by Increasing Access to Scale-Specific Information by Alicia B. W. Clifton, Alexander G. Stahlmann, Jennifer Hofmann, Alice Chirico, Rive Cadwallader and Jeremy D. W. Clifton in Perspectives on Psychological Science
Footnotes
Acknowledgements
We are grateful to Louis Tay, Jamie DeCoster, David Watson, Lee Anna Clark, and especially Richard Lerner for advice; Sonya Eremenco, Ben Arnold, and Dana Weiss, who provided comments on translation in the medical research field; and Primals Inventory translators who awakened us to scale-specific translation issues.
Transparency
Action Editor: Tina M. Lowrey
Editor: Klaus Fiedler
ORCID iDs
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
