Improving Scale Equivalence by Increasing Access to Scale-Specific Information

Abstract

Measures of the same phenomenon should produce the same results; this principle is fundamental because it allows for replication—the basis of science. Unfortunately, measures of a psychological construct in one language can often measure something a bit different in another language (i.e., low “scale equivalence”). Historically, the problem was thought to stem from insufficient knowledge of best-practice translation procedures. Yet solutions based on this diagnosis and their widespread adoption have not resolved the issue. In this article, we suggest that an additional problem might be insufficient information about the measure being translated. If so, low scale equivalence is a problem that translators and cross-cultural psychologists cannot solve on their own. We explore the possibility that measure-specific translation guides be created by original scale builders for the most widely used measures of important psychological constructs. We describe why such guides are needed, when they are needed, what they might look like, their feasibility, and next steps, providing a complete example guide and test case in a supplement concerning the Primals Inventory. In this article, we seek to spark discussion on translation practices happening behind the scenes and how greater transparency can improve scale equivalence, in the spirit of open science.

Keywords

measurement translation adaptation replication scale equivalence primal world beliefs

Scale building in psychology consists of two main tasks: writing items and empirically validating them (e.g., DeVellis, 2016). Empirical validation involves activities familiar to scientists generally, such as data collection and statistical analyses, with sufficient methodological transparency to allow replicability. Item writing, in contrast, can be a messy wordsmithing exercise. It requires skills that overlap less with those of other scientists and more with those of philosophers, poets, and marketers. Although informed by standardized techniques (e.g., cognitive interviewing; Willis, 2004), item-writing decisions are ultimately subjective, and descriptions thereof can feel out of place in scientific articles. Indeed, most scale-building articles spend pages on empirical item validation and a scant line or two on item writing, if any (e.g., the Satisfaction with Life Scale: Diener et al., 1985; the Primals Inventory: Clifton et al., 2019). This opacity on item-writing decisions, despite its central importance, leads to three predictable consequences: First, item-writing skills are undervalued (e.g., Sheatsley, 1983). Second, they are harder for researchers to develop despite excellent articulations of best-practice item-writing principles (DeVellis, 2016; Sheatsley, 1983).

Third, and most relevant here, any attempt to replicate item writing in a new cultural or linguistic context involves considerable guesswork, leaving scale translators/adaptors¹ in a lurch. Although it is easy for translators to run the same empirical-validation process (e.g., the same exploratory factor analysis with the same rotation) on items written in different languages—just “plug and chug”—how to replicate an item’s communicated message in a new language, pinpointing the same phenomenon at the same item-difficulty level (i.e., good item translation), is exceedingly challenging (e.g., Matsumoto & van de Vijver, 2011). Ironically, this is precisely where original scale-building articles offer the least guidance. If scale builders do not detail the pitfalls that they took pains to avoid in the creation of items, none should be surprised when scale translators fall into these pits unwittingly. With this article, we seek to spark discussion on the item-writing and translation processes happening behind the scenes and how greater transparency can improve scale equivalence, in the spirit of open science.

The Literature Gap

Two existing guidance literatures

To improve item writing during translation, two loosely defined guidance literatures have emerged (Table 1), although sometimes a single article provides both guidance types (e.g., Hambleton & Lee, 2013). Both are written by scale-translation-methodology experts, distilling insights applicable across scales—what we call “scale-generic”—from the broader translation-methodology literature. They are meant to aid individuals taking on particular translation projects, who are often not translation-literature experts. Both types are explicitly motivated by the goal of improving scale equivalence, defined as “the level of comparability of measurement outcomes,” in cross-cultural research through improved translation practices (Matsumoto & van de Vijver, 2011, p. 8; e.g., Beaton et al., 2000).

Table 1.

Three Translation-Guide Literatures Aiming to Improve Scale Equivalence

	Scale-generic process guides	Scale-generic content guides	Scale-specific guides (proposed)
Goal	Prescribe a standard scale translation process, step-by-step	Note issues that sometimes arise when translating any given scale	Note issues that usually arise when translating a particular scale
Equivalence-problem diagnosis	Lack of information about general translation processes	Lack of information about general item-translation principles	Lack of information about scale being translated
Main authors	Translation experts	Translation experts	Original scale creators
Primary audience	All translators	All translators	Translators of one scale
Secondary audience	None	None	Researchers interested in the construct being measured
User-friendliness	High	Low^a	High
Current size of literature	Small	Large	Missing

Only in comparison because scale-generic content guides alert the translator to issues that may or may not be relevant.

The first type is scale-generic process guides. These provide step-by-step instructions on how to execute a standard translation process (e.g., Beaton et al., 2000; Borsa et al., 2012; International Test Commission, 2017; van de Vijver & Hambleton, 1996). They are user-friendly because the vast majority of steps are relevant to the vast majority of translation efforts. However, they provide little guidance about how to translate actual scale items (i.e., picking the best words).

To that end, the second type is scale-generic content guides (e.g., Hambleton & Zenisky, 2011; Johnson et al., 2011), which alert translators to general challenges they might encounter while translating item content. For example, Hambleton and Zenisky (2011) compiled 25 issues often faced by scale translators, such as when a colloquialism in the source language has no equivalent in the target language.² These guides are, comparatively, not user-friendly because principles vary in relevance to particular translation efforts. Scale-generic content guides alert translators to potential issues (e.g., idioms) but help less once an issue is identified (e.g., translating that idiom). Discussion of each principle is brief, so those of critical relevance to a particular translation often require a deeper dive into the methodology literature, which in practice may not occur.

Despite the development of these two literatures to address scale equivalence in cross-cultural research, low scale equivalence remains pervasive. For example, in a systematic review of 26 child and adolescent psychopathology assessment scales, “none of the evaluated scales have strong evidence for cross-cultural validity and suitability for cross-cultural comparison” (Stevanovic et al., 2017, p. 126). Low scale equivalence persists even though scale-generic process guides have roundly succeeded in disseminating a standardized process (Chen, 2008). A recent methodological review of 78 health instruments “revealed that cultural adaptation followed similar steps with few variations” and that “most follow the process based on Beaton et al. (2000)” (Arafat et al., 2016, p. 129). In other words, scale translators have successfully organized themselves to try a solution (although subdisciplines vary) that, seemingly, has not resolved the problem.

Diagnosing the problem

When treatment fails to improve symptoms, sometimes the problem is misdiagnosis. In this case, the value of both guidance literatures is premised on a diagnosis of the core problem as a lack of information about generic translation principles and procedures. A further cause, however, might be lack of information about the particular scale being translated. Indeed, it is a logical leap to expect that scale-generic advice can satisfactorily resolve equivalence issues if most scale-building efforts present both common and unusual challenges.³ This notion is axiomatic throughout the scale-creation literature, in which generic advice about writing new items is consciously brief (Sheatsley, 1983). As DeVellis (2016) wrote, “Different variables call for different assessment strategies” (p. 7). If so, instead of more widespread adoption of scale-generic translation guides, researchers might try something different.

Current practice

In some ways, we propose simply to extend the way construct- and scale-specific advice is currently transferred from scale creators to translators, making it more formal, detailed, accessible, and incentivized. A caveat is that no empirical metascience research has surveyed translators to outline common practice, and considerable variation in practice exists. However, when researchers discover a new scale in a foreign language relevant to their outcomes of interest, they often start by emailing original scale builders for advice (e.g., Stahlmann et al., 2020), as is encouraged by most scale-generic process guides (including Beaton et al., 2000). What typically happens next has been mentioned only in passing in the literature (Herdman et al., 1998; Wild et al., 2005), but we expect most readers with translation experience will recognize the patterns. Our characterization is based on the anecdotal experience of the coauthors and colleagues around the world. We suspect it is typical of translation efforts in social psychology, personality psychology, positive psychology, and, to a lesser extent, clinical psychology.

In practice, the most common result of initial translator inquiries is no response from the original scale authors, followed in frequency by a brief, informal email suggesting a handful of ideas that occur in the moment (e.g., “Be careful how you translate X”; B. Arnold, personal communication, February 14, 2022). After multiple translator inquiries, more conscientious scale builders begin informally aggregating their notes into “frequently asked questions.” Although helpful, notes are often spotty. They are rarely the product of a careful and thorough review of challenges that emerged during original scale creation or previous translation processes. Original item writing usually happened years prior, and unlike the thoughtfully written, published record of empirical validation efforts, a detailed account of original item-writing choices is unlikely to exist. Many insights are simply forgotten. For example, when asked why a particular phrase was used, original scale creators may remember discussing the phrase with colleagues but not the content of those discussions, leading to responses such as, “I’m not sure why we said it that way, but I think we were probably concerned that . . .” Furthermore, scale builders may not have made clear-eyed item-writing choices to begin with. Herdman and colleagues (1998) noted that translator inquiries often motivate scale creators to “clarify the aims of their original version” (p. 328).

On rarer occasions, in response to numerous inquiries, informal scale-specific translation guidance documents can emerge, with newly remembered details accumulating such that each additional translation team is given some new pieces of advice. Documents are never subject to peer review, although it might lead to increased utility, clarity, or comprehensiveness regarding the most critical translation issues. Unbiased feedback from translators is infeasible, sometimes because of power dynamics between scale creators and scale translators, accentuated when the former is Western and, on average, more direct in communication style (e.g., Sanchez-Burks et al., 2003). Translators often cite personal communication with original scale creators, but reviewers cannot judge whether private guidance was followed. Even at their most developed, such documents remain spartan and, lacking a publication outlet and basic standards for format or rigor, are informally shared among colleagues or self-published on personal websites (e.g., ResearchGate). This is true even for scales translated dozens of times that serve as the foundation for large and important literatures on theoretically universal psychological processes.

For example, the Moral Foundations Questionnaire (MFQ; Graham et al., 2009) measures the five values of moral-foundations theory, a prominent theory in social psychology. According to moralfoundations.org, the MFQ has been translated into 39 languages (checked June 2021), the basis for a large literature on presumably universal values. Anticipating future translations, Haidt (2010) wrote a roughly one-page blog post on MFQ translation issues. It mainly describes how translators might consider the shades of meaning associated with a few terms, such as “respect,” “authority,” and “purity,” and might avoid strict translation. This is both much more guidance than typically provided and much less than we propose. For example, no subscale- or item-specific advice was included. (The link is also now inactive, underscoring access issues.) Iurino and Saucier (2020) recently tested measurement invariance of the MFQ across 27 countries. They concluded, “We were not able to replicate Graham et al.’s (2009) results indicating that a five-factor model is a suitable approach to modeling the moral foundations” (p. 370). All MFQ translations were created via standard translation processes by translators familiar with scale-generic translation-guidance literature (Iurino & Saucier, 2020). It is unclear whether the low scale equivalence resulted from genuine differences in latent variables across cultures or from mistranslation (see Note 3).

Another emerging example involves the Primals Inventory (Clifton et al., 2019). The Primals Inventory measures basic beliefs about the world called “primal world beliefs” (or “primals”), such as the beliefs the world is dangerous, interesting, abundant, and so forth, that are thought to shape many clinical and personality variables (e.g., depression and curiosity). In this case, when original scale authors (including first and senior authors on this article) began receiving inquiries from prospective translators, the same type of short, informal, accumulating ad hoc guidance document described above emerged. The first few translation processes suggested translators were falling into predictable traps that were construct- and scale-specific (described in the Supplemental Material available online) and could evade detection during back-translation. After 12 translators had reached out and substantial international measurement work on primals became likely, the value of detailed, systematic guidance coauthored by original scale creators and initial translation teams became clear.

Motivation to produce a guide, however, was low. As empirical scientists, unmasking numerous subjective decisions central to scale validity is an uncomfortable prospect, let alone providing those considerations as a basis for recommendation. There was also little professional incentive given the lack of quality publication outlet (confirmed afterward by manuscript rejections). Indeed, to our knowledge, no detailed scale-specific translation guide has ever been published by any peer-reviewed psychology journal. Finally, we had little idea what a rigorous, high-quality, translator-friendly, scale-specific guide might look like.

The proposed third guidance literature

In sum, a gap currently exists in the methodological literature: a lack of accessible, quality information on scales being translated. Translators typically cannot reliably learn (a) how and why items were written as they were, (b) which issues from the broader scale-translation literature are most relevant, and (c) how relevant issues might be addressed for that particular scale. To fill this gap, in this article, we discuss how the two existing types of scale-generic translation guides (Table 1) might be supplemented by a third type of scale-specific translation guide. Instead of translation-methodology experts, original scale creators would primarily author these guides, perhaps with input from translation experts and teams. Scale-specific guides would be user-friendly, highlighting only relevant insights from the broader scale-translation methodology literature and walking the translator step-by-step through scale content (including a comprehensive table of items with translation-relevant annotations, as is currently created by translation experts in some medical research; B. Arnold, personal communication, February 14, 2022). Guides would focus on the most critical item-writing decisions for replicating item meaning in a new language. As a rule, these guides would not provide generic translation guidance (e.g., Beaton et al., 2000) but supplemental construct-, scale-, subscale-, and item-specific insights only, which original scale creators are best positioned to provide. In the balance of this article, we discuss (a) conditions justifying the creation of scale-specific translation guides, (b) the form these guides might take (a suggested five-part format is illustrated by a full-length exemplar in the Supplemental Material), (c) the feasibility of creating these guides, and (d) next steps to explore their value empirically. We conclude by categorizing our approach not as a “best practice” but a “better one.” We encourage others to posit testable alternatives, so a best practice, empirically shown to improve scale equivalence, may eventually emerge.

Advice for Creating Scale-Specific Translation Guides

Eight conditions warranting a scale-specific guide

It is neither feasible nor necessary to create scale-specific translation guides for all psychological scales. Likely some or all of the following conditions should be met:

The construct is considered significant (e.g., clinical diagnosis) or universal (requiring cross-cultural research).

Multiple translation efforts are expected.

The construct is atypical conceptually (e.g., novel, abstract, easily conflated with other constructs).

Items involve difficult language, such as atypical or culture-specific diction (e.g., idioms) or troublesome grammar (e.g., reverse-scored items employing double negatives or counterfactuals).

The scale is multidimensional, especially when item-writing decisions were not uniform across dimensions or when translators wish to explore dimensionality in the target culture (Clifton, 2020; Haidt, 2010).

Risk of failing reliability benchmarks (α < .70) is high if just one item underperforms (e.g., subscales with few items or opposite-scored items, which—although important—often misbehave; Tay & Jebb, 2018).

Item-difficulty issues are present (e.g., extensive use of intensifiers such as “very” to achieve normally distributed responses).

Specific translation activities are needed (e.g., piloting, back-translation). Translators must sometimes prioritize, and scale creators can inform prioritization decisions.

In sum, scale-specific translation guides are most needed when a widely used scale measuring a major psychological construct involves scale-specific translation challenges that are unusual and important for scale equivalence.

Suggested format of scale-specific guides

Although various formats are plausible, we suggest scale-specific information be presented in the following five parts:

Scale background

Construct-specific issues

Item-specific issues

Initial translation efforts

Priority concerns

This format was developed through an iterative process of translator review and refinement. Each section is described alongside insights from the full-length exemplar guide to translating the Primals Inventory (see Supplemental Material). These details illustrate the presence of numerous, critical translation challenges specific to the Primals Inventory that scale-generic guides do not address.

I. Scale background

Guides would begin with a brief introduction to the construct and scale, a justification for scale-specific guidance based on the eight conditions above, and a reminder that the guide should supplement, not replace, existing scale-generic guidance. For example, the exemplar defines primal world beliefs (“primals”). It specifies the scale under consideration (the Primals Inventory) and notes that although subjectivity is inherent to item writing, many decisions have basis in piloting or empirical item-response characteristics.

II. Construct-specific issues

Next, a theory-oriented section would highlight the main difficulties the translator will face repeatedly across items (and subscales) given the peculiarities of the construct. Example topics include construct definition, important or repeated words, critical translation activities because of the construct, measurement strategy/philosophy, and empirically confirmed problems to watch for (e.g., skewed response patterns for particular subscales)—all issues that generic guidance cannot reasonably address. The exemplar guide examines eight such issues for the Primals Inventory (see the Supplemental Material):

how to reference “the world”;

navigating repeated use of unusual syntax (e.g., “It feels like . . . ”);

consulting atypical experts;

prioritizing item piloting;

item difficulty problems that vary across subscales;

translating additional items because of unusually short subscales;

including reverse-scored items;

calibrating translation goals given that primals are underexplored.

These eight issues concern both the process of translating items (e.g., Item 3) and the content of items being translated (e.g., Item 1), although the focus is mostly on content. To illustrate why these issues are critical for Primals Inventory scale equivalence but cannot be feasibly addressed by scale-generic guides, we highlight two examples below.

Referring to “the world.”

A major construct-specific issue the exemplar discusses is how to translate the notion of “the world”—the object of belief—which must be referenced in all 99 items but is referred to in more than 35 different ways. For many belief scales, the object of belief is clear enough that a single, familiar term can be used for all items (e.g., “chemotherapy”). But “the world” relevant to primal world beliefs is not simply the planet or a geographic location. It is the broadest psychologically meaningful habitat, which is defined by the respondent. Because all words, including “the world,” imperfectly capture this meaning, systematic error is avoided by referring to the world in many ways, tailored to each item. Piloting confirmed this variety as necessary. For example, the term “world” was adequate when measuring the idea that many things and situations are dangerous (e.g., “On the whole, the world is dangerous”) but not when trying to measure the belief that many things and situations reflect intention (hypothetically, “On the whole, the world is intentional”) because this evokes specifically human connotations. Therefore, a better term for “the world” is more specific to nature (hypothetically, “The universe is intentional”). Table S1 in the Supplemental Material provides translators with all terms used for “the world,” such as “life” and “most things and situations,” and recommends translators develop and reference their own similar bank during Primals Inventory translation.

Translating additional items

Another scale-specific issue the exemplar discusses proved critical in early translation efforts (and latter efforts during which advice was not followed). Compared with other scales in psychology, the Primals Inventory is unforgiving of translation miscalibration primarily because most subscales involve just four or five items. Subscale brevity means that “if even one translated item performs poorly, entire subscales can easily fail reliability benchmarks” (i.e., α < .70; see Supplemental Material). Thus, the guide strongly recommends that for each subscale, a few additional high-performing items from the original item pool be translated, administered, and culled as needed during item analysis. Even when following high-quality translation practices, an item may prove untenable in a new linguistic context and must be dropped. The exemplar specifies recommended additional items from the original item pool based on empirical item characteristics.

III. Item-specific issues

After construct discussions would come a fully pragmatic section discussing each subscale and item in an annotated table, allowing item-specific challenges to remain top-of-mind throughout translation. Table content might include:

(sub)scale concept definition(s);

original (sub)scale items;

item-specific annotations, especially highlighting item difficulty, colloquialisms, and translation alternatives;

additional notes highlighting key challenges.

To illustrate, Table 2 provides information for the subscale measuring the belief the world is “improvable.” Although we group item-specific comments by subscale in the exemplar, separate discussions of each item may typically be preferable.

Table 2.

Excerpt From Table S2 of the Supplemental Material Available Online: Annotated Table of Primals Inventory Items for Translation Purposes

Improvable (5 items; 2 additional items)
Definition: Improvable (vs. too hard to improve) is the belief that most things can be readily changed for the better.
Original Items	Intensifier (bolded)	Downtoner (italicized)	Colloquial Phrase (underlined)	Translation Notes
It’s possible to significantly improve basically anything encountered in life.	X	–	–	Definition: The opposite of Improvable is not unchangeable, which would relate too much conceptually to Stable (vs. fragile) or static (vs. Changing). Instead, this primal captures the general difficulty level of making things better. Language: Avoid connotations that make the respondent think too much about big things, such as governments, poverty, climate, etc.—of course such big things are hard to change. Instead, the phrases “most things and situations,” “life,” and “things” (rather than “the world”) indicate typical malleability, mundane or not. In addition, items should avoid implying whether things are easy or hard to change for the respondent personally. Items should not measure beliefs about one’s own competence or ability to change things but how difficult things generally are to change for any given agent (person, creature, or even natural force), hence the importance of phrases like “no matter who you are.” This is not a general self-efficacy measure, which is a belief about the self. Item Difficulty: Both forward and reverse-scored Improvable items require intensifiers because otherwise they would be too easy to agree with.
In most situations, making things way better is absolutely possible.	X	–	–
Most things and situations are responsive, workable, and totally possible to improve.	X	–	–
Most situations seem really difficult if not impossible to improve.*	X	–	–
No matter who you are, you can significantly improve the world you live in.	X	–	–
Recommended Additional Items
Life is full of stubborn problems, situations, and issues that just can’t be solved.*	X	–	X
Though sometimes hard, it feels totally possible to change things and make them much better.	X	–	–

Reverse-scored items.

IV. Initial translation efforts

Each time a scale is translated, much is learned, especially the first few times. A fourth section would describe initial experiences so major challenges can be surfaced. Written primarily by initial translators, this section can facilitate peer dialogue and demonstrate how issues encountered in one effort led to empirically demonstrated improvement in the next. For example, after the first Primals Inventory effort (German) encountered internal reliability problems, steps taken in the second (Italian) ameliorated the concern. Note that this section of the exemplar also broadens a guide’s focus beyond idiosyncrasies in the source language such that translation comments can be more culturally neutral (Acquadro et al., 2018; van de Vijver & Leung, 2021).

V. Priority concerns

A final section would comment on priorities. The time and financial resources of independent translation efforts vary widely. In response to early drafts of the exemplar guide, translators suggested ending with a sense of how to prioritize the many suggestions provided. Thus, the exemplar ends by noting the three most important takeaways: (a) carefully translating “the world,” (b) calibrating item difficulty, and (c) initially administering one or two additional items for some subscales. Without the initial German and Italian translations, these priorities would have been different.

Feasibility Requires Publishability

The main suggestion of this article is that to improve scale equivalence, translators may require increased access to scale-specific information. This may be achieved via the creation of a third translation-guidance literature. On the basis of our experience, in the previous section, we specified a form such documents might take; however, any number of formats might work. Assuming such guides improve scale equivalence (empirical next steps discussed in closing), the more pressing issue is likely feasibility.

A major strength of the proposed approach is that scale-specific translation guides are stand-alone documents that can be written after the original scale-creation effort. They do not depend or impinge on the initial effort to achieve validity in a particular culture. Some worthy proposals for improving cross-cultural scale equivalence have involved writing original scale items that are easy to translate from the start (Acquadro et al., 2018; van de Vijver & Leung, 2021). For example, in creating the Social Axioms Survey II, psychologists from 10 countries generated intentionally “pancultural” scale items and established reliability and validity for those items across countries (Leung et al., 2012, p. 836). Such approaches can be invaluable, but most new scales are never widely used in their initial cultural context, let alone translated for others (e.g., SCImago, n.d.). Thus, most original scale creators best serve the psychology-research community by focusing on creating the most valid measure they can in the original language. In fact, the measurement of some latent variables, such as religious thoughts (e.g., about the “soul”) or political attitudes (e.g., “liberalism”), may require culturally based language. For these reasons, a strength of scale-specific guides is that they offer an avenue to increase scale equivalence once international literatures become likely.

Another strength is that the proposed approach echoes concurrently emerging efforts in some nonpsychology disciplines. For example, even though optimizing original scale items for translation from the outset has become widespread in the international-medical-research community over the past 10 years (S. Eremenco, personal communication, February 14, 2022), scale-specific guidance is still considered a “necessary foundation to improve the semantic and conceptual equivalence of the translations” (Eremenco et al., 2017, p. 7; see also Herdman et al., 1998). Therefore, some large, collaborative, international projects examining the efficacy of medical treatments have created detailed scale-specific documents, similar in many respects to Section III in the guide proposed above (List of Adult Measures, 2021; The WHOQOL Group, 1994). Others, incentivized by regulators such as the U.S. Food and Drug Administration, create proprietary translations including scale-definition documents (e.g., ICON Language Services, D. Weiss, personal communication, February 24, 2022).

One reason these practices remain relatively rare in psychology may be a lack of clarity regarding format and content—hence our specification and exemplar. Another reason, however, concerns the lack of professional incentives. These guides are significant undertakings; the exemplar concerns a long scale, is 50 pages, and remains far from exhaustive. Indeed, the main reason the proposed third literature has not already emerged may be that, absent central planning or regulation as in the medical field, psychologists have little incentive to produce scale-specific translation guides.

Some ways of encouraging the creation of scale-specific guides are obvious. For example, newly published scale-generic guides could recommend consultation of scale-specific guides where available. Other ideas can be readily dismissed, such as appending scale-specific guides to original scale-creation articles (for reasons discussed above) or simply asking original scale builders to write better scale-specific guides and continue publishing them informally (e.g., on ResearchGate). The latter is untenable because researchers are motivated to produce publishable materials; peer-reviewed publication is not just the basis for professional advancement but also for shaping the conversation and ensuring its quality. Slightly better might be to include scale-specific guides in test manuals. However, not all disciplines use test manuals or subject them to peer review.

Therefore, we suggest that the most feasible way to encourage the development of the proposed third literature is to make scale-specific translation guides publishable via a peer-reviewed outlet (journal, journal special section, or edited book series). Reviewers might include construct experts, translation-literature experts, and prospective translators. The next section considers the two main potential objections to this proposal.

Two Potential Objections

Readership

The first main objection to creating a peer-reviewed outlet for scale-specific translation guides is limited readership. We estimate the target audience for a scale-specific guide (translators of that scale) is unlikely to exceed 50 to 100 researchers. However, each one would likely cite the guide, providing a citation floor exceeding the mean for psychology articles (< 1 citation annually; even prominent journals have a 50-year mean of 2.1 annually; Cho et al., 2012; Harzing, 2010).

Secondary audiences are also likely. The article most akin to—although different from—the sort of scale-specific translation guide proposed is an edited chapter titled “Guide to Constructing Self-Efficacy Scales” by Bandura (2006b). It was not created for adapting self-efficacy scales into new languages but new competency domains (e.g., self-efficacy at maintaining an exercise routine) and is therefore not an ideal comparison. Still, it reads much like the exemplar guide, highlighting construct-level issues (e.g., how self-efficacy is different from self-esteem and locus of control), item-wording issues (e.g., “items should be phrased in terms of can do rather than will do” to emphasize capacity; Bandura, 2006b, p. 308), and activities to prioritize (e.g., piloting). Readers not interested in measuring self-efficacy might understandably consider this minutia because, similar to our exemplar in the Supplemental Material, Bandura’s target audience is scale creators for one construct.

Nevertheless, this chapter became one of the most cited articles in self-efficacy (cited 7,483 times, Google Scholar, June 2021). This is poorly explained by outlet prestige. The book had 15 chapters; Bandura’s guide has been cited 6 times more than the next most-cited chapter and 30 times more than the rest. Heavy citations are also poorly explained by authorship because by chance, the next most-cited chapter was also written by Bandura (2006a) and for a wider target audience (developmental psychologists interested in agency). To better understand, we randomly selected 100 citing articles and categorized them by primary reason for citation (4% cited erroneously or were unavailable), as follows:

Forty-six percent cited the guide to discuss conceptual insights into the nature, definition, or function of self-efficacy, for example, “Bandura (2006) understood self-efficacy as . . .”

Thirty-one percent cited the guide to create a new domain-specific self-efficacy scale (target audience).

Sixteen percent cited the guide to evaluate a previously created self-efficacy scale.

Three percent cited the guide to apply Bandura’s methodological insights to other scales or constructs.

In sum, two thirds of citations (65%) were by secondary audiences, especially researchers interested not in measurement but in construct definition. Although we do not suggest that scale-specific translation guides will receive as many citations overall, we can expect some citations outside the target audience from researchers (a) discussing the construct, (b) evaluating previous translations, or (c) applying methodological insights to other scales. If guides are created on touchstone measures of important constructs (i.e., two of the eight specified conditions justifying a guide are met), they would provide deep insight from a top theorist (e.g., Haidt on the MFQ). Impact factors could rise further as outlets become clearinghouses for only those scales editors and reviewers deem worthy of and ready for translation, with translators increasingly avoiding scales without published guides.

Subjectivity

The second main objection to establishing a publication outlet for scale-specific translation guides might be subjectivity. Indeed, various discussions in these guides will be based on pilot evidence, scale-builder speculation, initial translation efforts that may not be representative, and so forth. This concern is reasonable and reflects an appropriately high standard for scientific claims, but it is misplaced. To borrow a nautical analogy, scale-specific guides are not navigational charts claiming to map objective reality, but pragmatic advice from one sailor to the next on what to look out for when navigating a tricky waterway. The task is subjective yet unavoidable. Each translation process involves hundreds if not thousands of decisions about what words to use in the target language. The question is not whether subjectivity is inherent in the field’s scientific methodology but whether psychologists can bring themselves to openly discuss it.

Open discussion would advance the three values of openness, integrity, and reproducibility identified by the Center for Open Science (Nosek, 2017). Increased openness would promote scale-creator integrity by counteracting questionable practices, such as obfuscating or omitting key item-writing decisions (Ferguson, 2015), as well as translator integrity; when decisions are based on “personal communication,” readers cannot discern the advice given, whether it was followed, or how adherence might affect score interpretation. Indeed, 16% of authors citing Bandura’s guide did so to critique the scales and findings of other researchers. Critically, low cross-cultural equivalence can be considered a replication issue in which insufficient item-writing detail precludes reliable replication of item meaning in new languages. If so, addressing it is a step toward ameliorating the broader “replication crisis” (Maxwell et al., 2015). Finally, we add a fourth value: researcher skill. Good item writing and translation are difficult, they take skill, and skill development is aided by a community of experts who openly discuss their actions and rationale.

Next Steps: Testing the Approach Empirically

Before any incentivizing, however, it remains an open empirical question whether such guides would meaningfully improve scale equivalence. More scale-specific information is theorized to be helpful (e.g., Eremenco et al., 2017), but this has not been shown. We suggest research proceed on dual fronts. First, to define the problem further, metascience studies could survey (a) scale translators on their access to scale-specific translation guidance (related to Katan, 2009) and (b) original scale creators on whether such materials were produced. Second, to determine causation, researchers might (a) conduct experiments in which translators adapt a scale with or without a scale-specific translation guide (inspired by PACTE Group, 2008) and (b) review the long-term impact (if any) of guides on the cross-cultural equivalence of translated scales (similar to Stevanovic et al., 2017). The Primals Inventory translation guide may serve as an imperfect test case for the latter.

Concluding Remarks

A perennial difficulty in the social sciences is balancing nomothetic approaches (focusing on averages and general laws) with idiographic concerns (the characteristics of particular cases). Since Freud, psychology has tilted nomothetic (Robinson, 2011). Efforts to resolve low scale equivalence, including translation guides, reflect that leaning. In cases in which those efforts have not worked, however, new approaches should be discussed. Our suggestion is to try something more idiographic. We detailed a particular approach that we believe has been invaluable to our research area, could be helpful to others, and seems feasible (detailed, publishable, peer-reviewed, five-part, scale-specific translation guides). However, the specificity provided should not be interpreted as a recommendation of best practice but, rather, a spark for other approaches and testing. We hope that in due course, the empirical literature matures sufficiently to warrant recommendation of best practice.

In the meantime, our overarching argument is for the value of increased translator access to scale-specific information, in whatever form. Scale building involves two main tasks: writing items and empirically validating them. Both tasks require methodological description, with the necessary level of detail determined by what makes replication possible (e.g., Kallet, 2004). Yet original scale articles typically spend pages describing empirical validation and a sentence or two on item writing. The logical consequence is difficulty reproducing item meaning in new contexts. If scale translation is the effort to replicate items’ received message in a new linguistic landscape (Egger et al., 2003), cross-cultural research is the arena in which the consequences of insufficient item-writing detail should be most pronounced. Of course, some translators succeed anyway, but whenever guesswork misses the mark, scale equivalence suffers. Expecting otherwise is akin to believing experimental studies can forego detailed method sections without replication failures because generic guidance exists for how to conduct scientific experiments. Particularities matter. Not all can be discussed; important ones can be.

Supplemental Material

sj-docx-1-pps-10.1177_17456916221119396 – Supplemental material for Improving Scale Equivalence by Increasing Access to Scale-Specific Information

Supplemental material, sj-docx-1-pps-10.1177_17456916221119396 for Improving Scale Equivalence by Increasing Access to Scale-Specific Information by Alicia B. W. Clifton, Alexander G. Stahlmann, Jennifer Hofmann, Alice Chirico, Rive Cadwallader and Jeremy D. W. Clifton in Perspectives on Psychological Science

Footnotes

Acknowledgements

We are grateful to Louis Tay, Jamie DeCoster, David Watson, Lee Anna Clark, and especially Richard Lerner for advice; Sonya Eremenco, Ben Arnold, and Dana Weiss, who provided comments on translation in the medical research field; and Primals Inventory translators who awakened us to scale-specific translation issues.

Transparency

Action Editor: Tina M. Lowrey

Editor: Klaus Fiedler

ORCID iDs

Alicia B. W. Clifton

Alexander G. Stahlmann

Alice Chirico

Rive Cadwallader

Jeremy D. W. Clifton

Supplemental Material

Additional supporting information can be found at

Notes

References

Acquadro

Patrick

D. L.

Eremenco

Martin

M. L.

Kuliś

Correia

Conway

; on behalf of the International Society for Quality of Life Research (ISOQOL) Translation and Cultural Adaptation Special Interest Group (TCA-SIG). (2018). Emerging good practices for translatability assessment (TA) of patient-reported outcome (PRO) measures. Journal of Patient-Reported Outcomes, 2(1), Article 8. https://doi.org/10.1186/s41687-018-0035-8

Arafat

Chowdhury

Qusar

Hafez

(2016). Cross cultural adaptation and psychometric validation of research instruments: A methodological review. Journal of Behavioral Health, 5, 129–136.

Bandura

(2006a). Adolescent development from an agentic perspective. In Pajares

Urdan

(Eds.), Self-efficacy beliefs of adolescents (Vol. 5, pp. 1–43). Information Age.

Bandura

(2006b). Guide for constructing self-efficacy scales. In Pajares

Urdan

(Eds.), Self-efficacy beliefs of adolescents (Vol. 5, pp. 307–337). Information Age.

Beaton

D. E.

Bombardier

Guillemin

Ferraz

M. B.

(2000). Guidelines for the process of cross-cultural adaptation of self-report measures. Spine, 25, 3186–3191.

Borsa

J. C.

Damásio

B. F.

Bandeira

D. R.

(2012). Cross-cultural adaptation and validation of psychological instruments: Some considerations. Paidéia (Ribeirão Preto), 22, 423–432.

Chen

F. F.

(2008). What happens if we compare chopsticks with forks? The impact of making inappropriate comparisons in cross-cultural research. Journal of Personality and Social Psychology, 95, 1005–1018.

Cho

K. W.

Tse

C. S.

Neely

J. H.

(2012). Citation rates for experimental psychology articles published between 1950 and 2004: Top-cited articles in behavioral cognitive psychology. Memory & Cognition, 40(7), 1132–1161.

Clifton

J. D. W.

(2020). Managing validity versus reliability trade-offs in scale-building decisions. Psychological Methods, 25, 259–270.

10.

Clifton

J. D. W.

Baker

J. D.

Park

C. L.

Yaden

D. B.

Clifton

A. B. W.

Terni

Miller

J. L.

Zeng

Giorgi

Schwartz

H. A.

Seligman

M. E. P.

(2019). Primal world beliefs. Psychological Assessment, 31, 82–99.

11.

DeVellis

R. F.

(2016). Scale development: Theory and applications (Vol. 26). Sage.

12.

Diener

E. D.

Emmons

R. A.

Larsen

R. J.

Griffin

(1985). The satisfaction with life scale. Journal of Personality Assessment, 49, 71–75.

13.

Egger

J. I.

De Mey

H. R.

Derksen

J. J.

van der Staak

C. P.

(2003). Cross-cultural replication of the five-factor model and comparison of the NEO-PI-R and MMPI-2 PSY-5 scales in a Dutch psychiatric sample. Psychological Assessment, 15(1), 81–88.

14.

Eremenco

Pease

Mann

Consortium’s Process Subcommittee. (2017). Patient-reported outcome (PRO) consortium translation process: Consensus development of updated best practices. Journal of Patient-Reported Outcomes, 2(1), Article 12. https://doi.org/10.1186/s41687-018-0037-6

15.

Ferguson

C. J.

(2015). “Everybody knows psychology is not a real science”: Public perceptions of psychology and how we can improve our relationship with policymakers, the scientific community, and the general public. American Psychologist, 70(6), 527–543.

16.

Graham

Haidt

Nosek

B. A.

(2009). Liberals and conservatives rely on different sets of moral foundations. Journal of Personality and Social Psychology, 96(5), 1029–1046.

17.

Haidt

(2010, January 1). How to translate the MFQ. YourMorals Blog. https://ymblog.jonathanhaidt.org/2010/01/how-to-translate-the-mfq/comment-page-1/

18.

Hambleton

R. K.

Lee

M. K.

(2013). Methods for translating and adapting tests to increase cross-language validity. In Saklofske

D. H.

Reynolds

C. R.

Schwean

V. L.

(Eds.), The Oxford handbook of child psychological assessment (pp. 172–181). Oxford University Press.

19.

Hambleton

R. K.

Zenisky

A. L.

(2011). Translating and adapting tests for cross-cultural assessments. In Matsumoto

van de Vijver

F. J. R.

(Eds.), Cross-cultural research methods in psychology (pp. 46–74). Cambridge University Press.

20.

Harzing

A. W.

(2010). The publish or perish book: Your guide to effective and responsible citation analysis. Tarma Software Research Pty Ltd.

21.

Herdman

Fox-Rushby

Badia

(1998). A model of equivalence in the cultural adaptation of HRQoL instruments: The universalist approach. Quality of Life Research, 7(4), 323–335.

22.

International Test Commission. (2017). The ITC guidelines for translating and adapting tests (2nd ed.). https://www.intestcom.org/files/guideline_test_adaptation_2ed.pdf

23.

Iurino

Saucier

(2020). Testing measurement invariance of the Moral Foundations Questionnaire across 27 countries. Assessment, 27(2), 365–372.

24.

Johnson

T. P.

Shavitt

Holbrook

A. L.

(2011). Survey response styles across cultures. In Matsumoto

van de Vijver

F. J. R.

(Eds.), Cross-cultural research methods in psychology (pp. 130–175). Cambridge University Press.

25.

Kallet

R. H.

(2004). How to write the methods section of a research paper. Respiratory Care, 49(10), 1229–1232.

26.

Katan

(2009). Occupation or profession: A survey of the translators’ world. Translation and Interpreting Studies, 4, 187–209.

27.

Leung

Lam

B. C. P.

Bond

M. H.

Conway

L. G.

Gornick

L. J.

Amponsah

Boehnke

Dragolov

Burgess

S. M.

Golestaneh

Busch

Hofer

Espinosa

del.

C. D.

Fardis

Ismail

Kurman

Lebedeva

Tatarko

A. N.

Sam

D. L.

. . . Zhou

(2012). Developing and evaluating the social axioms survey in eleven countries: Its relationship with the five-factor model of personality. Journal of Cross-Cultural Psychology, 43(5), 833–857.

28.

List of Adult Measures. (2021). Patient-reported outcomes measurement information system. U. S. Department of Health and Human Services. https://www.healthmeasures.net/explore-measurement-systems/promis/intro-to-promis/list-of-adult-measures

29.

Matsumoto

van de Vijver

F. J. R.

(Eds.). (2011). Cross-cultural research methods in psychology. Cambridge University Press.

30.

Maxwell

S. E.

Lau

M. Y.

Howard

G. S.

(2015). Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist, 70(6), 487–498.

31.

Nosek

B. A.

(2017). Center for open science: Strategic plan. OSF Preprints. https://doi.org/10.31219/osf.io/x2w9h

32.

PACTE Group: Beeby

Fernández

Fox

Hurtado Albir

Kozlova

Kuznik

Neunzig

Rodríguez

Romero

(2008). First results of a translation competence experiment: ‘Knowledge of translation’ and ‘Efficacy of the translation process.’ In Kearns

(Ed.), Translator and interpreter training. Issues, methods and debates (pp. 104–126). Bloomsbury.

33.

Robinson

O. C.

(2011). The idiographic/nomothetic dichotomy: Tracing historical origins of contemporary confusions. History & Philosophy of Psychology, 13(2), 32–39.

34.

Sanchez-Burks

Lee

Choi

Nisbett

Zhao

Koo

(2003). Conversing across cultures: East-West communication styles in work and nonwork contexts. Journal of Personality and Social Psychology, 85(2), 363–372.

35.

SCImago. (n.d.). SJR — SCImago Journal & Country Rank [Portal]. http://www.scimagojr.com

36.

Sheatsley

P. B.

(1983). Questionnaire construction and item writing. In Rossi

P. H.

Wright

J. D.

Anderson

A. B.

(Eds.), Handbook of survey research (pp. 195–230). Academic Press.

37.

Stahlmann

A. G.

Hofmann

Ruch

Heinz

Clifton

J. D. W.

(2020). The higher-order structure of primal world beliefs in German-speaking countries: Adaptation and initial validation of the German Primals Inventory (PI-66-G). Personality and Individual Differences, 163, Article 110054. https://doi.org/10.1016/j.paid.2020.110054

38.

Stevanovic

Jafari

Knez

Franic

Atilola

Davidovic

Bagheri

Lakic

(2017, February). Can we really use available scales for child and adolescent psychopathology across cultures? A systematic review of cross-cultural measurement invariance data. Transcult Psychiatry, 54(1), 125–152. https://doi.org/10.1177/1363461516689215

39.

Tay

Jebb

A. T.

(2018). Establishing construct continua in construct validation: The process of continuum specification. Advances in Methods and Practices in Psychological Science, 1, 375–388.

40.

van de Vijver

F. J.

Hambleton

R. K.

(1996). Translating tests. European Psychologist, 1, 89–99.

41.

van de Vijver

F. J.

Leung

(2021). Methods and data analysis for cross-cultural research (Vol. 116). Cambridge University Press.

42.

The WHOQOL Group. (1994). The Development of the World Health Organization Quality of Life Assessment Instrument. In Orley

Kuyken

(Eds.), Quality of life assessment: International perspectives (pp. 41–60). Springer.

43.

Wild

Grove

Martin

Eremenco

McElroy

Verjee-Lorenz

, & ISPOR Task Force for Translation Cultural Adaptation. (2005). Principles of good practice for the translation and cultural adaptation process for patient-reported outcomes (PRO) measures. Value in Health, 8(2), 94–104.

44.

Willis

G. B.

(2004). Cognitive interviewing: A tool for improving questionnaire design. Sage.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

3.51 MB