Abstract
To investigate the effect of using negatively oriented items, we wrote semantic reversals of the items in the Rosenberg Self Esteem Scale, UCLA Loneliness Scale, and the General Belongingness Scale and used them to create four experimental conditions. Participants (N = 2,019) were recruited through Amazon’s Mechanical Turk. Data were assessed for dimensionality, item functioning, instrument properties, and associations with other variables. Regarding dimensionality, although a two-factor model (positively vs. negatively oriented factors) exhibits better fit than a unidimensional model across all conditions, bifactor indices were used to argue that a unidimensional interpretation of the data can be employed. With respect to item functioning, factor loadings were found to be nearly invariant across conditions, but thresholds were not. Concerning instrument properties, inclusion of negatively oriented items results in lower mean scores and higher score variances. Instruments with both positively and negatively oriented items demonstrated lower reliability estimates than those with only one orientation. For associations with other variables, path coefficients in a model where loneliness mediates the effects of belongingness on life satisfaction and self-esteem were found to vary across conditions. Findings suggest that negatively oriented items have minor impact on instrument quality, but influence measurement model and path coefficients.
Keywords
Whether tis nobler in a measure to suffer
The slings and arrows of methodological variance,
Or to take arms against negative item orientation
And by opposing stop its use.
Researchers consider many elements when writing items during instrument development, including construct relevance, difficulty of endorsement, and comprehensibility. Typically, item-writing consists of writing positively oriented items that capture the presence of the construct of interest; however, many researchers also include negatively oriented items. The use of negatively oriented items is often justified as a remedy to response style errors, specifically acquiescence and extreme response bias (Barnette, 2000; Sauro & Lewis, 2011). Despite the inclusion of negatively oriented items being a common practice, their usefulness has been questioned (K. L. Cole et al., 2019; Sliter & Zickar, 2014). Furthermore, negatively oriented items are often included uncritically; as stated by Dalal and Carter (2014), Including negatively worded items has become so commonplace, that many academics, practitioners, and students include negatively worded items in their scales without specifying why, as if to suggest that the practice is so well known that describing their rationale is not necessary. (p. 114)
Psychometric research has long provided evidence for differences between positive and negative item orientation, such as influencing means, variances, reliabilities, and factor structures (Wang et al., 2018). Additionally, increased comprehension difficulty and reduced item discrimination of negatively oriented items relative to positively oriented items has been evidenced (Barnette, 2000; Sliter & Zickar, 2014). This growing body of research suggests that the use of negatively oriented items, although prevalent, varies in methodological purpose and can cause unexpected or unintended psychometric issues resulting in poor measurement of the constructs of interest (e.g., Dalal & Carter, 2014). Although including negatively oriented items may be well-intentioned, the psychometric value of this practice may not be well-founded.
The purpose of this study is to further examine orientation effects on dimensionality, item properties, instrument properties, and relationships with external variables in a manner that has not been done previously. Our aim is to extend the current research by experimentally manipulating the use of positively and negatively oriented items, contributing findings that better explain the validity of negatively oriented items, thus promoting psychometrically sound instrument development and item writing practices.
Controversial Value of Negative Item Orientation
The phrase “negatively oriented” indicates that endorsement of an item is associated with lower levels of the measured trait (Bandalos, 2018, p. 90). This property is sometimes referred to as “negatively worded” (DeVellis, 2016), “reverse-coded” (DiStefano & Motl, 2006), or “negatively keyed” (Lindwall et al., 2012). Because scores on such items have a negative correlation with the trait being measured, they are typically recoded before being included in the construction of a total scale score. For example, in the General Belongingness Scale (GBS; Malone et al., 2012), the item “I feel as if people do not care about me” is negatively oriented because higher scores on this item indicate lower levels of sense of belonging. The negative orientation was accomplished through use of the word “not;” items which feature a negative particle are referred to as “negatively phrased” (Bandalos, 2018, p. 93). It is also possible for a negatively oriented item to convey the opposite meaning of a construct without use of negation. For example, in the UCLA Loneliness Scale (UCLALS; Russell et al., 1980), endorsement on the item “I feel part of a group of friends” is negatively related to loneliness and yet the item stem contains no negative particle. Such items are referred to as “reverse worded” since the semantics of the wording results in an inverse relationship with the construct of interest (van Sonderen et al., 2013).
Dalal and Carter (2014) conducted a comprehensive literature review of the psychometric issues associated with negative item orientation. They concluded from previous studies as well as psychometric theory, that positively oriented items and negatively oriented items do not measure the same construct. Furthermore, Dalal and Carter found that systematic method variance resulted when negatively oriented items were used in an instrument, calling into question the validity of the data collected from instrument’s using negative item orientation. This method variance can artificially inflate correlations with scores from other instruments containing negatively oriented items or artificially deflate correlations with other instruments not consisting of negatively oriented items. Most important, they concluded that negatively oriented items caused measurement error, primarily evidenced by loading onto a separate unintended factor within an exploratory factor analysis (EFA). In other words, participants’ patterns of responding to the negatively oriented items generated an independent factor representing an artifact of the language of the item rather than its underlying construct. This can cause researchers to inappropriately describe data as being multidimensional when the underlying construct is in fact unidimensional.
Key studies support these psychometric concerns with negative item orientation. For example, Sliter and Zickar (2014) used item response theory (IRT) to examine whether negatively oriented items provided the same psychometric properties as positively oriented items when used in a single self-report instrument. They found evidence that negatively oriented items provide less information, have lower discriminations, and generally degrade measurement quality. They advocated that careful consideration should be taken when choosing “whether to include negatively worded items in self-report scales, because negatively and positively worded items simply do not function equivalently” (p. 223). Zhang et al. (2016) expanded on the work of Sliter and Zickar by experimentally comparing the performance of reverse worded and negatively phrased items. They found negatively phrased items and reverse worded items had different properties and sources of common variance beyond the intended substantive factor. Zhang et al. (2016) recommend careful consideration and investigation of the specific form a negatively oriented item takes.
Other researchers have examined the existence of a method factor in data related to item orientation in specific self-report instruments. For example, Gu et al. (2015) applied a bifactor model to data collected from a commonly used instrument to demonstrate the presence of a specific factor related to the negative orientation and the implications this has for interpreting data from this instrument as unidimensional. The authors found that some item commonality could be explained by a separate factor modeling the wording of negatively oriented items rather than any substantive construct (Gu et al., 2015). Similar findings from models employing only a negative method factor have been demonstrated by a number of other scholars (e.g., DiStefano & Motl, 2006). Other researchers have suggested that common methodological variance can be attributed to both positively and negatively oriented items, and therefore suggest employing both positive and negative method factors for mixed orientation instruments (Wang et al., 2018).
Several studies have approached the investigation of item orientation effects by manipulating existing self-report instruments. For example, Greenberger et al. (2003) manipulated the Rosenberg Self-Esteem Scale (RSES; Rosenberg, 1965). Their experimental approach suggested that the original RSES with both positively and negatively oriented items was multidimensional. However, when the RSES was manipulated to consist of only positively or negatively oriented versions, the researchers found evidence for unidimensionality. Several other researchers have followed this line of inquiry with different psychological instruments. Salazar (2015) adapted the Keyes Subjective Well-Being Scale into a positively oriented version, and two additional versions that combined positively and negatively oriented items differently. Results from this study reported that the combined versions had lower estimates of coefficient alpha (α) and responses to the positively oriented instrument may have been contaminated by acquiescence response bias. Zhang et al., (2016) followed a similar procedure established by Greenberger et al. (2003) using the Need for Cognition Scale (Cacioppo & Petty, 1982). They reported findings of unidimensionality for positively and negatively oriented versions, as well as poor unidimensional fit for the original Need for Cognition Scale that had items with mixed orientation. The findings from Zhang et al. aligned with those found by Greenberger et al. (2003). Likewise, K. L. Cole et al. (2019) compared four versions of the Perceived Stress Scale (Cohen et al., 1983) and found that the unidimensional model exhibited the best fit and highest overall reliability (α) in the positively oriented condition.
This collection of experimental studies indicates a number of possible problems that utilizing negatively oriented items can incur: introduction of additional sources of common variance, diminished quality of model fit, lower item information which reduces reliability, and increased susceptibility to acquiescent responding. Accordingly, many authors have suggested that the controversial use of negatively oriented items require further investigation, with many self-report instruments being potentially susceptible to similar unintended psychometric consequences.
Review of Common Self-Report Instruments With Positively and Negatively Oriented Items
In addition to the RSES, this study investigates the UCLALS (Russell et al., 1980), which measures loneliness and the GBS (Malone et al., 2012), which measures sense of student belonging. The instruments chosen for the current study (a) include both positively and negatively oriented items; (b) have been investigated extensively to provide validity evidence based on internal (dimensional) structure and relationships with other variables (e.g., convergent and discriminant validity; American Educational Research Association et al., 2014); and (c) measure theoretically related, but distinct, constructs (see, e.g., Baumeister & Leary, 1995; Mellor et al., 2008).
Rosenberg Self-Esteem Scale (Rosenberg, 1965)
The use of the RSES has been investigated across a wide range of contexts (e.g., Chao et al., 2017; Mane, 2016) and languages (e.g., Abdel-Khalek, 2011; Konaszewski & Sosnowski, 2017; C. H. Wu, 2008). Previous studies on the internal structure of data from the RSES suggested findings that include one-factor (Shevlin et al., 1995), two-factor (Supple et al., 2013), and bifactor (McKay et al., 2015) solutions, where positively and negatively oriented items form separate factors in the multidimensional models. A recurring finding is that the model which best fits data from the RSES consisted of a single trait factor with either positively or negatively phrased items loading onto a latent method factor (Gana et al., 2013; Marsh et al., 2010). However, the question of whether the RSES measures one or two constructs is not settled (McKay et al., 2015; Supple et al., 2013).
UCLA Loneliness Scale (Russell et al., 1980)
The UCLALS has been investigated across a variety of samples, including high school (Dussault et al., 2009) and university (Durak & Senol-Durak, 2010) contexts. Studies of the UCLALS have produced solutions that range from one-factor (e.g., Hartshorne, 1993), two-factor (e.g., Wilson et al., 1992), and three-factor model (e.g., Shevlin et al., 2015). The two-factor solution suggested negatively oriented items load onto a different factor than positively oriented items. The three-factor solution suggested that all negatively oriented items loaded onto one factor and the positively phrased items were divided among two separate factors. In many cases (e.g., Durak & Senol-Durak, 2010; Shevlin et al., 2015) in which a multidimensional factor model was employed, an overall total scale score was used.
General Belongingness Scale (Malone et al., 2012)
As the most recent among the other instruments in this study, the GBS has had limited psychometric evaluation. Two existing studies examined the factor structure of data collected from the GBS. Malone et al. (2012) fit a two-factor confirmatory factor analysis (CFA) model in which the six positively oriented items loading on the acceptance factor and six negatively oriented items loading on the rejection factor, which produced tenable fit. Despite identifying two factors, Malone et al. (2012) argued for use of a unidimensional total observed score to represent general sense of belonging because of high observed interfactor correlation. Yildiz (2017) adapted the GBS into a Turkish version and found the same two factor solution using EFA.
Relationships Among Self-esteem, Loneliness, and Belonging
The psychological instruments identified for this survey experiment provide useful information independently, but in relation to each other offer the opportunity for further analyses, including explanatory capabilities through structural relationships (i.e., path analysis). Researchers (e.g., Baumeister & Leary, 1995; Mellor et al., 2008) have hypothesized relationships among our main study variables and life satisfaction. Specifically, these studies suggest that loneliness mediates the effects of belongingness on life satisfaction and self-esteem. A widely used measure of life satisfaction is the Satisfaction with Life Scale (SWLS; Diener et al., 1985) which consists entirely of positively oriented items.
Purpose of Study
Considering the extant literature addressing measurement artifacts due to orientation effects and reported factor analyses of the self-report instruments identified for this study, we investigated the value of negative item orientation on popular psychological instruments. Specifically, we sought to extend the experimental conditions of Greenberger et al. (2003) with the UCLALS and GBS to investigate if identical dimensionality, measurement invariance, statistical properties, and structural invariance exist across different item orientation conditions.
Our investigation approaches the issue of orientation effects through an experimental survey design extending previous research (e.g., K. L. Cole et al., 2019; Greenberger et al., 2003; Salazar, 2015; Zhang et al., 2016). Our experiment presents four conditions: the original instrument (original condition), an inversion of the original instrument items (reversal condition, which was not studied by Greenberger et al., 2003), all items positively oriented (all positive condition), and all items negatively oriented (all negative condition). Explained in detail in our Method section, these four conditions will allow us to discern the internal structure of the constructs of interest as they exist within our study sample. In comparison with previous studies (Benson & Hocevar, 1985; Marsh, 1996, Sliter & Zickar, 2014), our ability to conduct a more rigorous analysis using modern techniques builds on existing knowledge about the use of positively and negatively oriented items. Furthermore, by using multiple self-report instruments measuring different constructs, we will be able to more thoroughly investigate the effect of item orientation on relationships with other variables. In this study, we aim to evaluate four primary research questions:
Method
Design and Procedure
Experimental conditions expanding on the methodology of previous studies (e.g., Greenberger et al., 2003; Salazar, 2015; Zhang et al., 2016) were used to investigate the value of negative item orientation on the 12-item RSES (6 positively, 6 negatively oriented), 20-item UCLALS (10 positively, 10 negatively oriented), and 10-item GBS (5 positively, 5 negatively oriented). In the original condition, items were presented unaltered as they appeared on the original instrument. The reversal condition consisted of a semantic reversal of all the original items such that positively oriented items were restructured using negatively oriented language and vice versa. Using original and reversed items, the all positive condition presented only positively oriented items, whereas the all negative condition used only negatively oriented items. The text of all items across all conditions can be found in the online supplemental materials, available at https://doi.org/10.17605/OSF.IO/WQF9A. In addition to the psychological instruments with positive and negatively oriented items, the SWLS was included in all conditions to aid in providing validity evidence based on relationships with other variables.
Data were collected using Qualtrics, which presented the survey through a web-based format to participants and stores responses through a secure online management system. The mean length of time spent completing the survey was 6.95 min (SD = 5.22 min).
Participants were randomly assigned to one of four conditions which included:
Original condition: 21 positively oriented items, 21 negatively oriented items;
Reversal condition: 21 negatively oriented items, 21 positively oriented items;
All positive condition: 42 all positively oriented items;
All negative condition: 42 all negatively oriented items.
The 5 item, all positively oriented SWLS was included in all four conditions, so that each participant responded to a total of 47 items. All items across all instruments were presented in a random order to combat order effects which can manifest as local dependency. The institutional review board of the author’s primary institution approved this study (IRB no. 45271) and all participants provided electronic approval to participate prior to completing the survey. We report how we determined our sample size, all data exclusions, all manipulations, and all measures in the study. The data collected for this project as well as all analysis files can be found in the online supplemental materials.
Participants
Participants were recruited anonymously through Amazon’s Mechanical Turk (MTurk). In total, 2,019 participants who met the inclusion criteria (English-speaking from the United States or Canada, 18 years old and high school educated at minimum) completed the study. Participants included 704 males and 1,299 females, with an average age of 38.68 years (SD = 12.43 years). Participants completed a version of each instrument including demographic questions. For each condition, the number of participants ranged from 503 to 507. The sample demographics (see Table 1) indicate the sample is reasonably representative of the adult population, despite having more females, older participants, and participants with higher levels of education compared with the typical MTurk participant pools recruited for online surveys (Keith et al., 2019).
Description of Study Participants by Experimental Condition.
Note. Participants recruited using Amazon’s Mechanical Turk following guidelines recommended by Follmer et al. (2017). N = number of participants.
Guidance presented by Follmer et al. (2017) regarding attention checks and procedural details in setting up the MTurk panel were followed to ensure that our sample reflected typical participants on these instruments sourced from traditional methods (e.g., college-aged or community-based samples). Three attention check questions were used to verify legitimate participation (i.e., one verification of human participant [not a robot], and two items in which participants were directed to choose a predetermined response). Participants who failed any attention check question were routed to the end of the survey without compensation. Participants above the age of 18 years who passed all attention check questions were compensated $0.25 for their time.
Instruments
The original instruments are described below and estimates of ordinal coefficient α computed using polychoric correlations are provided from our study. Appendix A provides the text of all items used in all conditions.
Rosenberg Self-Esteem Scale (Rosenberg, 1965)
The RSES measures individual self-esteem and uses 10 items and a 4-point Likert-type response format ranging from 1 (strongly disagree) to 4 (strongly agree). Five of the items are positively oriented, including “On the whole I am satisfied with myself.” The other five items are negatively oriented and reverse coded, including “At times I think I am no good at all.” Higher scores are an indication of higher self-esteem (αOriginal = .94, αReversed = .91, αAll Positive = .94, αAll Negative = .95).
UCLA Loneliness Scale (Russell et al., 1980)
The UCLALS is a 20-item instrument assessing global loneliness. The UCLALS uses a 4-point Likert-type response format ranging from 1 (never) to 4 (always). Ten items are positively oriented, including “I am unhappy being so withdrawn.” The other 10 items are negatively oriented and reverse coded; an example of a negatively oriented item is “I feel in tune with the people around me.” Higher scores reflect greater loneliness (αOriginal = .96, αReversed = .97, αAll Positive = .97, αAll Negative = .97).
The General Belongingness Scale (Malone et al., 2012)
The GBS is a 12-item instrument assessing global belongingness. The GBS uses a 7-point Likert-type response format ranging from 1 (strongly disagree) to 7 (strongly agree). Six items are positively oriented, including “I feel accepted by others.” The other six items are negatively oriented and reverse coded; an example of a negatively oriented item is “I feel as if people do not care about me.” Higher responses reflect greater sense of belonging (αOriginal = .96, αReversed = .95, αAll Positive = .96, αAll Negative = .97).
Satisfaction With Life Scale (Diener et al., 1985)
The SWLS measures individual satisfaction with one’s life and uses five items and a 7-point Likert-type response format ranging from 1 (strongly disagree) to 7 (strongly agree). All five items are positively oriented; an example item is “In most ways my life is close to my ideal.” Higher scores are an indication of higher satisfaction (α = .92).
Item Revisions for Experimental Manipulation
In order to construct the surveys for the four experimental conditions, it was necessary to reverse the orientation of all items on the original instruments. We began by developing a process for creating reversed orientation of items and evaluating these reversals. First, items were rewritten to relay the same content with opposite meaning of the original version. Additionally, negative particles (e.g., no, not) and affixal morphemes (e.g., un-, non-, dis-, -less) were avoided when possible, as direct negation in items is widely believed to increase cognitive load on the responder and result in lesser quality of measurement (Gnambs & Schroeders, 2020; Marsh, 1996; Sliter & Zickar, 2014). Instead, the use of antonyms to reverse orientation was preferred. For example, the UCLALS item “I am an outgoing person” was given the reversal “I am a shy person.” Finally, the research team considered the cognitive demand and complexity of the revision, ensuring that each item was comprehensible despite reversal based on these guidelines.
The research team created reversals of the original items on the UCLALS and GBS based on these criteria through a multistage writing process. First, each team member suggested oppositely oriented versions of all items. Then, the team came to consensus on the reversal of each item. Following this initial step, the research team reviewed all items and flagged items which did not meet our criteria. Flagged items were then revised and reassessed until all team members came to consensus about the quality of all item reversals. The reversed items of the RSES published by Greenberger et al. (2003) were used by the research team to allow for robust comparisons with prior research.
Expert Reviews
It was crucial that the reversed items were equivalent to the original items, so once the team agreed on the phrasing of the revised items, the item reversals were subjected to expert review. Expert review provides a degree of validity evidence demonstrating the items fully and clearly represent the construct of interest. Also, the expert review process was used to ensure that each item reversal was an accurate revision of the original item. Following the recommendations of Rubio et al. (2003), six individuals were chosen for expert review including advanced graduate and faculty researchers in psychology, linguistics, and applied linguistics.
Each expert completed a review of the original items and the items’ reverse orientations that included three main questions: (a) How well does the inverse of the original item assess the opposite meaning of the original item? (b) What proportion of adults will be able to comprehend the meaning of the opposite of the original item and respond as intended? and (c) If the opposite of the original item is not satisfactory, how would you recommend this item to be phrased? Experts responded to the first item using a 4-point Likert-type response format ranging from 1 to 4 (poorly, adequately, almost exactly, exactly). The second item included a response format ranging from 1 to 3 (some participants, most participants, almost all of the participants). The results of the expert review were then examined by the team, and any items that were rated as “poorly,” “adequately,” or “some participants” by two or more expert reviewers were selected for further review, reworded, and resent to experts until consensus was reached. Items were modified based on the recommendations of the experts. In total three items were flagged as “adequately” and subsequently modified by the team using feedback from the expert reviewers and group consensus.
Data Analyses
Power Analysis
Prior to participant recruitment, power analysis was performed using Mplus 8.4 (L. Muthén & Muthén, 2019) using Monte Carlo simulation methods. Power analysis was performed to assure adequate sample size to address the second research question regarding measurement invariance testing in a CFA framework. Based on literature reviewed in the introduction, standardized factor loadings for a unidimensional model are expected to be approximately 0.7 or larger. Therefore, a population model of 10 polytomous items with standardized factor loadings of 0.7 was used. A sample size of 450 per condition was determined to be adequate to achieve at least 80% power for detecting a standardized factor loading difference of 0.10 on oppositely oriented items following the analytic strategy for measurement invariance testing described below. Additionally, it was found that a sample size of 400 per condition is sufficient to achieve at least 80% power for detecting a small threshold shift (0.2 standard deviations of the person distribution) on oppositely oriented items. Accordingly, a sample size of 500 per condition was deemed appropriate.
Data Cleaning
The data cleaning process was conducted in R (R Core Team, 2018) and included screening for outliers. Respondents who did not pass MTurk validation checks were excluded from the final data set; this criterion effectively eliminated participants who did not complete a substantial majority of the survey. Mahalanobis distances were used to identify multivariate outliers, whereas QQ plots and box plots were used to identify univariate outliers and participants whose responses did not fit well with the distribution of responses. Participants identified as outliers were subjected to sensitivity analysis; all analyses conducted in this project were performed both including and excluding the set of outliers. As no result was meaningfully different when including or excluding outliers, the outlying cases were retained for analyses. Missing data in the final data set was minimal at the item level, with levels of missingness below 1%.
Question 1: Dimensionality
The first set of analyses performed was designed to address the question of whether the internal structure of data from each instrument will be the same across conditions. Although the original version of the instruments in this study have been subject to factor analyses, the modified versions have not. Therefore, an EFA was conducted on the polychoric correlation matrix using FACTOR 10.8 (Lorenzo-Seva & Ferrando, 2018) in order to use the eigenvalues to determine dimensionality using both the Hull method (Lorenzo-Seva et al., 2011) and parallel analysis (k = 1,000 replications; Cho et al., 2009; Horn, 1965) as verifications of the number of factors to extract as determined by visual analysis of the scree plot.
Although methodologists argue against performing both EFA and CFA on the same data (Fokkema & Greiff, 2017), it is appropriate in our study since the results of the EFA are only used to suggest the number of dimensions to extract, not to determine and then confirm the CFA models examined. Instead, we conducted CFAs on data collected from each instrument across conditions to directly compare the fit of unidimensional, correlated two factors, and bifactor models (i.e., a general factor and two orientation effect method factors), which were suggested by literature reviewed in the introduction.
Due to the categorical nature of item responses, CFAs were estimated using the weighted least squares with mean and variance correction estimator based on polychoric correlations. Guidelines proposed by Asparouhov and Muthén (2018) and Maydeu-Olivares and Joe (2014) were used to evaluate exact fit and approximate fit of the CFA models. Exact fit was assessed using the χ2 statistic and a claim of exact fit was deemed tenable if χ2 was not significant (p > .05). If exact fit did not hold, then the standardized root mean square residual (SRMR) was used to assess approximate fit. Approximate fit was concluded if SRMR < .05 (Maydeu-Olivares & Joe, 2014) and absolute residual correlations were small (Asparouhov & Muthén, 2018). Based on recommendations by Kline (2016), small residual correlations may be defined as having an absolute value below .10.
If a bifactor model was able to represent the data structure for a given instrument, then bifactor indices were used to further examine dimensionality of the data on each instrument across conditions. Explained common variance (ECV), omega (ω), omega hierarchical (ωH), and average absolute relative parameter bias (ARPB) were calculated using an Excel-based bifactor indices calculator (Dueber, 2017). Based on guidelines suggested by Rodriguez et al. (2016), interpreting a unidimensional latent variable is reasonable when ECV > .70 (Bonifay et al., 2015) and when ARPB < .10 to .15 (B. Muthén et al., 1987), as relative bias between the bifactor general factor loadings and unidimensional factor loadings will be minimal. Following Nunnally and Bernstein (1994), Rodriguez et al. (2016) recommend interpreting a unidimensional total score when ωH is greater than .80. If these conditions are not met, then the two-factor correlated traits model is preferred.
Following dimensionality assessment, a preferred model was chosen according to the decision rules described in the data analysis plan and used for further analysis.
Question 2: Item Functioning
To test whether positively and negatively oriented versions of the same item share psychometric properties, measurement invariance testing was performed by applying increasingly restrictive constraints to model parameters across groups using procedures demonstrated in Pendergast et al. (2017). In the event that measurement invariance was not found tenable, measurement nonequivalence effect size indices of Nye and Drasgow (2011) were computed using the dmacs package (Dueber, 2019) for R.
Guided by the multigroup factor analysis procedure presented by Pendergast et al. (2017), a series of CFAs on the RSES, UCLALS, and GBS across all conditions were conducted by the research team to further investigate measurement invariance. The selected model was chosen while addressing the first research question was used as a configural model on which measurement invariance testing was performed. Measurement invariance testing consists of subjecting the configural model to increased restrictions on equality of item parameter estimates across all four conditions simultaneously, and testing for differences in the quality of model fit. As H. Wu and Estabrook (2016) detail, polytomous multigroup factor analysis models can be specified with 10 different sets of cross-group equality restrictions. For the purposes of this study, three different levels of measurement invariance testing were deemed appropriate. First, a configural model with no restrictions on item parameters across groups was fit and tested for exact and approximate fit. Second, item factor loadings (λ) and item residual variances (θ) were fixed across groups to test for invariance of internal covariance structure, termed weak measurement invariance. Finally, item thresholds (τ) and item intercepts (ν) were also fixed across groups to test for invariance of internal mean structure, termed strict measurement invariance. The following thresholds were used to compare change-in-model fit indices as strict restrictions were increased (Stark et al., 2006). Model comparisons were conducted using the difference in χ2 test as performed using DIFFTEST in Mplus. Models were deemed invariant if the DIFFTEST was not significant (p > .05), although large samples are allowed to be significant according to Cheung and Rensvold (2002). Additionally, measurement invariance was concluded if differences in the root mean square error of approximation (ΔRMSEA) was less than or equal to .015 (Chen, 2007) and the difference in comparative fit index (ΔCFI) was greater than or equal to −.002 (Meade et al., 2008).
In the event that measurement invariance at any step was found untenable, partial measurement invariance was investigated. The goal of partial measurement invariance is to identify specific items which are responsible for the lack of invariance, and to support a claim of invariance of the remaining items. To accomplish this, modification indices are used to identify items which are suspected of not being invariant. Then cross-group equality constraints on parameter estimates for these items are relaxed across all groups, and the model is retested for change in fit using the established criteria. Items are selected in this way until the change-in-fit criteria are satisfied. In the resulting model, items with parameter estimates which are constrained across groups are deemed invariant, whereas other items are considered noninvariant.
Question 3: Instrument Properties
To test whether positively and negatively oriented items produced data with the same qualities, various statistical properties of total scale scores for RSES, UCLALS, and GBS were assessed. For each instrument in each condition, the mean score, standard deviation of scores, Cronbach’s α, and ωH were computed.
Question 4: Relationships With Other Variables
Finally, to test whether relationships among scores from study variables were constant across conditions, we inspected the invariance of the structural parameters by analyzing an observed path model that suggests loneliness mediates the effects of belongingness on life satisfaction and self-esteem (Figure 1; Baumeister & Leary, 1995; Mellor et al., 2008). It should be noted that we consider this model not to make causal claims, but rather to investigate the potentially biasing effect of differing item orientations on parameter estimation in a complex model. While simple correlations can be used to provide validity evidence based on relationships with other variables, small biases in correlations can compound to create larger biases in a complex model (see D. A. Cole and Preacher [2014] for a discussion of this phenomenon in the context of measurement error). Thus, testing a multivariate path model may reveal issues that would remain undetected in a simpler model such as a bivariate correlation.

Proposed mediation model among observed study variables.
Structural invariance was performed using Bayesian estimation methods as discussed by Asparouhov and Muthén (2017) and B. Muthén and Muthén (2013). In Asparouhov and Muthén’s (2017) framework, parameter differences across groups are assessed for differences by imposing small variance priors and performing the significance tests in the posterior distribution (B. Muthén & Asparouhov, 2013). A significance test is provided for each parameter estimate from each group to see whether that estimate is different from the average estimate of that parameter. Consider, for example, the GBS → UCLA path in Figure 1. A prior distribution for this path coefficient across groups is imposed with a variance of .0025, indicating that we expect some small amount of variability in this coefficient across groups. Following estimation of the posterior distribution for the model pictured in Figure 1, Mplus (L. Muthén & Muthén, 2019) computes expected values for the GBS → UCLA path coefficient in each group and performs a significance test to see whether the estimated coefficient for each group differs from the average estimated coefficient across groups.
A Bayesian approach was chosen for estimating these models due to its ability to simultaneously test invariance of each parameter in each group (B. Muthén & Asparouhov, 2013). To perform all of these tests within a maximum likelihood estimation framework would require estimation of many different models and would be greatly complicated by attempting to account for noninvariance of one parameter while testing another. Bayesian estimation allowed for a simpler estimation and testing procedure. However, it should be noted that our interpretation of this model is from a frequentist perspective; this use of Bayesian estimation techniques to support analysis from a frequentist perspective is consistent with the usage of Asparouhov and Muthén (2017) and B. Muthén and Muthén (2013).
Unless noted otherwise, all data analyses were conducted in Mplus 8.4 (L. Muthén & Muthén, 2019).
Results
Question 1: Dimensionality
In order to assess whether item orientation affects the internal structure of data collected from an instrument, EFA and CFA techniques were employed. Inspection of the eigenvalues using parallel analysis and the Hull method suggested one factor to extract across all instruments and conditions. The number of factors to extract was further supported by visual inspection of the scree plots (see Figure 2, for an exemplar).

Observed scree plot for the Rosenberg Self-Esteem Scale (RSES) by experimental condition.
CFA results showed generally improving model fit for more complicated models (see Table 2). Notably, all conditions across all measures showed a significant chi-square test of model fit. SRMR was acceptably low (<.05) for all measures except UCLALS. Unidimensional UCLALS models for all conditions and the two factor model for the all negative condition exhibited SRMR > .05, but SRMR was acceptable for the two factor model in other conditions and bifactor models for UCLALS in all conditions. Several residual correlations for all UCLALS models were above .10, although none were as high as .20. Local fit for the UCLALS bifactor models showed closer to acceptable levels of residual correlations. Similarly, the GBS all negatively oriented condition had a single large residual correlation in all three models. Across all other instruments and conditions, the bifactor model showed approximate fit. Accordingly, the bifactor models will be analyzed further.
RSES, UCLALS, and GBS Confirmatory Factor Analysis Exact and Approximate Fit Indices by Experimental Condition.
Note. All chi-square tests of model fit were statistically significant (p < .001). RSES = Rosenberg Self-Esteem Scale; UCLALS = UCLA Loneliness Scale; GBS = General Belongingness Scale; df = degrees of freedom; RMSEA = root mean square error of approximation; CFI = comparative fit index; SRMR = standardized root mean square residual; One = unidimensional model; Two = two factor model.
ECV across all instruments and orientation conditions was larger than .80, suggesting that over 80% of the common variance among items is explained by the general factor for the RSES, UCLALS, and GBS. In addition, the ARPB across all conditions were at or less than .10, which provides evidence for equivalence between the bifactor general factor and the unidimensional factor loadings for the RSES, UCLALS, and GBS. Finally, ωH was larger than .85 across all instruments and conditions, which suggests that total scores can be treated as reflecting a unidimensional construct for the RSES, UCLALS, and GBS. All results from the calculations of the bifactor indices across conditions and instruments are reported in Table 3.
RSES, UCLALS, and GBS Bifactor Indices by Experimental Condition.
Note. RSES = Rosenberg Self-Esteem Scale; UCLALS = UCLA Loneliness Scale; GBS = General Belongingness Scale; ECV = explained common variance; ω = omega; ωH = omega hierarchical; ARPB = average relative parameter bias.
Based on these results, further analysis of all instruments across all conditions will be performed using a unidimensional model and interpretation. While the unidimensional measurement models exhibit more misfit they enable simpler interpretation and analyses. Furthermore, the high ECV and low ARPB indices indicate that the latent variable of the unidimensional model will behave similarly as the general factor of the bifactor model. Likewise, high ωH indicates that observed total scores are largely explained by a single general factor, representing the construct of interest. Therefore, despite the additional misfit, a unidimensional interpretation of each instrument was deemed appropriate
Question 2: Item Functioning
To test whether psychometric properties of an item are influenced by item orientation, measurement invariance of data from each instrument across conditions was tested within a unidimensional model, per the results of dimensionality testing reported in Table 4. The results of measurement noninvariance testing can be found in Table 5. For RSES, the weak invariance model showed worse fit than the configural model according to the chi-square difference test, CFI, and RMSEA. Items 5 and 8 were identified as having noninvariant factor loadings, which were then freed in a partially weakly invariance model. This model showed worse fit than the configural model according to chi-square difference test, while change in CFI and RMSEA suggested adequate change in overall model fit. For UCLA and GBS, weak invariance was found to be tenable according to change in CFI and RMSEA despite a significant change in chi-square test. For all three instruments, the strict invariance model demonstrated substantially worse fit than the weak invariance model. Across instruments, no short list of noninvariant items could be identified; instead, most items exhibit noninvariant thresholds.
Fit Indices for Measurement Invariance Testing.
Note. RSES = Rosenberg Self-Esteem Scale; UCLA = UCLA Loneliness Scale; GBS = General Belongingness Scale; df = degrees of freedom; Δp = p value associated with chi-square difference test; RMSEA = root mean square error of approximation; CFI = comparative fit index; SRMR = standardized root mean square residual.
Measurement Nonequivalence Effect Size Indices.
Note. RSES = Rosenberg Self-Esteem Scale; UCLA = UCLA Loneliness Scale; GBS = General Belongingness Scale; Scale Δmean = expected difference in total score between versions due to measurement noninvariance; Item Δmean = average expected difference in item score due to measurement noninvariance; Mean dMACS = average measurement nonequivalence effect size; P → N = items were positively oriented in the original condition but are negatively oriented in the all negative condition; N → P = items were negatively oriented in the original condition but are positively oriented in the all negative condition. Numbers in parentheses represent standard deviations across items.
In general, changing items from positively oriented to negatively oriented resulted in lower average item scores. Contrariwise, changing items from negatively oriented to positively oriented resulted in higher average item scores. Average absolute change in item mean from switching from positive to negative orientation or negative to positive orientation ranged from 0.17 to 0.61, and the average measurement nonequivalence effect size ranged from 0.29 to 0.71. Thus, item orientation is expected to have a substantial effect on total (average) scores.
Question 3: Instrument Properties
Statistical properties for the RSES, UCLALS, and GBS were then assessed across conditions to determine if positively and negatively oriented items produced data from instruments with the same qualities (see Table 6). As reported previously (see Table 3), the ωH across instruments were above .85, suggesting high reliability based on the bifactor CFA models. Cronbach’s α reliability estimates across conditions ranged from .93 to .99 for RSES, between .92 and .99 for GBS, and was consistently .96 for UCLALS. For the original and reversal conditions, ωH values were noticeably lower than alpha; however, for the all positive and all negative conditions, these α and ωH estimates were much closer to each other.
RSES, UCLALS, and GBS Summary Statistics by Experimental Condition.
Note. RSES = Rosenberg Self-Esteem Scale; UCLALS = UCLA Loneliness Scale; GBS = General Belongingness Scale; α = ordinal alpha; ωH = Omega hierarchical.
Overall, observed means varied substantially based on the number of negatively oriented items. For UCLALS and GBS, each negatively oriented item decreased the mean score by approximately 0.5 observed score points, while for RSES, this decrease was below 0.2, reflecting the results of measurement nonequivalence testing. The difference between mean scores in the all positive orientation and all negative orientation conditions was 1.77 (0.28 pooled standard deviations) for RSES, 9.72 (0.73 pooled standard deviations) for UCLALS, and 6.4 (0.38 pooled standard deviations) for GBS. For instance, the mean score for UCLALS in the all positive condition was 62.51 while the mean score for the UCLALS in the all negative condition was 52.79; as participants were randomly assigned to conditions, this difference in average scores of 9.72 is expected to be due to the difference in item orientation.
Question 4: Relationships With Other Variables
A Bayesian multigroup path analysis of the model depicted in Figure 1 was fit to the data to assess the influence of item orientation on structural relationships (path coefficients) between related constructs. Table 7 reports the path coefficient estimates for the tested observed mediation model by experimental condition. Estimates for observed path coefficients significantly vary across conditions for all parameters except the direct path from UCLA to SWLS. The standard deviation of a parameter across conditions varies between 0.01 and 0.07. For each parameter, the estimate from each group was tested for being different from the average estimate of that parameter across groups. In total, 5 of the 20 estimates were found to differ significantly from the mean. While most of these differences would not result in substantially different interpretations, in two cases, a path coefficient is not statistically different from zero for one condition but is statistically different from zero for the other conditions. Namely, path coefficients are not significantly different from zero for UCLA → RSES in the reversed item orientation condition and GBS → SWLS in the all negatively oriented condition.
Path Coefficient Estimates for Mediation Model by Experimental Condition.
Note. All path coefficients are significantly different from zero except for UCLA → RSES in the reversed item orientation condition and GBS → SWLS in the all negatively oriented condition. SWLS = Satisfaction with Life Scale; RSES = Rosenberg Self-Esteem Scale; UCLA = UCLA Loneliness Scale; GBS = General Belongingness Scale; SD = standard deviation of estimate across the four conditions.
Denotes the parameter estimate is significantly different from estimates of the same parameter in other conditions.
Discussion
Our study used data collected from multiple instruments commonly used in psychology and education to investigate the influence of item orientation on several measurement traits.
Question 1: Dimensionality
In regard to internal structure and dimensionality, findings indicated that treating the data from each instrument as unidimensional was appropriate and this finding was further supported by analysis of eigenvalues from EFA. Although bifactor indices indicate the appropriateness of using a unidimensional interpretation of the data, our analyses did not specifically prohibit a multidimensional solution. Indeed, the superior fit of the two-factor model over the one-factor model for all three instruments in the original and reversed conditions allow for a multidimensional interpretation in the presence of both positively and negatively oriented items. Therefore, despite our interpretation of the data as unidimensional, our results do not contradict previous findings of multidimensionality (Gnambs & Schroeders, 2020; Roszkowski & Soven, 2010; see also literature on RSES, UCLALS, and GBS reviewed in the Introduction section herein) or poor fit of unidimensional models when both positive and negative oriented items are included (Greenberger et al., 2003; Salazar, 2015; Sliter & Zickar, 2014).
For the purposes of dimensionality, items of different orientations induce noticeable but largely insubstantial amounts of multidimensionality. Therefore, we join with Horan et al. (2003) in recommending fitting latent variable models including separate method factors for positively and negatively oriented items (Wang et al., 2018). These models can be analyzed to determine whether a single instrument score is interpretable (Rodriguez et al., 2016) or whether methodological variance due to item orientation needs to be accounted for (Weijters et al., 2013).
Decisions about dimensionality should be driven by substantive concerns (Zickar, 2020); if it is the case that positively and negatively oriented items are presupposed by theoretical considerations or prior empirical evidence to measure different constructs, then they can and should be modeled as separate dimensions. However, the superior fit of a two-factor model over a one-factor model does not predicate the use of a multidimensional interpretation (Reise, 2012). Instead, the strong general factor in the bifactor models we fit provides evidence that a unidimensional interpretation may still be appropriate (Rodriguez et al., 2016). Psychometric analyses can only speak to the appropriateness of using a particular interpretation of the data; dimensionality of a construct is an inherently theoretical concern.
Question 2: Item Functioning
Concerning invariance of item parameter estimates across orientation conditions, findings from measurement invariance testing indicated that factor loadings were invariant across conditions for UCLALS and GBS, but not for RSES. This means that the positive and negative orientations of each item measure the single underlying construct with the same strength of association. One of the RSES items found to not have an invariant factor loading across conditions was given a questionable reversal by Greenberger et al. (2003). Specifically, their reversal of “I wish I could have more respect for myself” was “I think I have more respect for myself;” this reversal is both grammatically unsound and loses the original item intent of comparing amount of self-respect to a desired amount of self-respect. “I believe I have enough respect for myself” would be a more proper reversal. The lack of invariance of factor loadings for RSES may be a problem of item reversal quality for RSES items rather than an issue with positive versus negative item orientation. Item writing, and item revising, is a tremendously difficult and time-consuming task for which both internal and expert review, coupled with linguistic expert review, is necessary. To provide continuity with Greenberger et al. (2003), we purposely did not change this reversal.
Given prior research findings that negatively oriented items produce lower item discriminations, standardized factor loadings, or average interitem correlations (K. L. Cole et al., 2019; Sliter & Zickar, 2014; Zhang et al., 2016), finding that factor loadings were invariant across conditions was surprising. However, several researchers have found that negative item orientation effects are substantially less pronounced for high ability readers (Gnambs & Schroeders, 2020; Kam & Fan, 2018). Therefore, given the high education level of our sample, it is reasonable that negatively oriented items elicited the target construct from participants just as effectively as positively oriented items.
Item thresholds, on the other hand, were consistently noninvariant across all instruments, with item thresholds being generally higher for positively oriented items compared with negatively oriented items. This finding is congruent with the research of Kamoen et al. (2013), Salazar, (2015), and Sliter and Zickar (2014), all of whom found that negatively oriented items exhibited lower average item scores than roughly equivalent positively orientated items.
Question 3: Instrument Properties
Statistical properties of the total scores of the three instruments varied substantially across conditions. Means of total scores vary across conditions, with more negatively oriented items being associated with lower mean scores. This is consistent with, and indeed follows directly from, the results of measurement invariance testing which found generally higher thresholds for the negatively oriented version of each item. This difference is less pronounced with the RSES than with the UCLALS and GBS. Combined with varying standard deviations of total scores across conditions, these results raise concerns about the relative ability of positive and negatively oriented items to equivalently measure changes in mean scores across groups or time but was not testable herein.
Additionally, while the different types of reliabilities were fairly similar to each other in the all positive and all negative conditions, ωH was consistently lower than alpha in the original and reversed conditions. We suggest two possible interpretations.
In the first interpretation, we assume that positively and negatively oriented items measure the same construct and any multidimensionality induced by the presence of both positively and negatively oriented items is methodological. In this case, reliability estimates in the all positive and all negative conditions include reliable variance due to common method variance (Podsakoff et al., 2003), whereas ωH from the bifactor model in the original and reversed conditions allows common method variance due to item orientation to be appropriately removed from consideration of reliability. This explanation provides a link between our findings and Gu et al.’s (2017) findings from simulated data that failing to account for methodological variance (e.g., as in computing α) will result in reliability estimates which are overestimates of a squared correlation between observed total scores and the substantive construct of interest because total scores reflect additional, methodological, latent constructs. Thus, we join generations of methodological researchers (Green et al., 1977; Hogan et al., 2000; Jaeger, 1991; McNeish, 2018; Sijtsma, 2009) in recommending that great care be taken by applied researchers and methodologists in selecting and interpreting reliability indices.
In the second interpretation, we assume that positively and negatively oriented items measure slightly different constructs. In this case, the items in the all positive and all negative orientation conditions measure a single construct, so alpha and ωH provide similar estimates of reliability. On the other hand, the original and reversal conditions measure two separate constructs, with a total score possibly reflecting a single higher order construct. In these conditions, α is higher than ωH due to alpha’s unfortunate ability to account for all reliable variance in total scores, even that which is due to the presence of multidimensionality rather than the single construct of interest (McNeish, 2018; Sijtsma, 2009). In this interpretation, use of both positively and negatively oriented items is required for measuring both the positive and negative forms of the construct of interest.
Question 4: Relationships With Other Variables
Next, when we consider the effects of item orientation on structural relationships, the multigroup Bayesian path analysis indicates statistically significant but, for the most part, substantively minimal differences in path coefficients across conditions. These findings are consistent with Greenberger et al. (2003) who found that RSES had relatively stable correlations with external variables (except depression) across item orientation conditions. As three of the five significantly different coefficients are found in the reversal condition, it is possible that imperfect item reversals are to blame.
However, two of the twenty path coefficients (UCLA → RSES in the reversal condition and GBS → SWLS in the all negative condition) are meaningfully different from their values in the other conditions. Moreover, both of coefficients are not significantly different from zero in the condition in which they are different, which could lead to different conclusions in applied research. When considered across a corpus of research concerning a construct, in which a variety of instruments are used to measure that construct, some heterogeneity of results is expected. Thus although differences in these two path coefficients are concerning, they are not fatally so.
The largely manageable differences in path coefficients across conditions combined with the findings of weak invariance (partial for RSES) are preliminary statistical evidence that differences in item orientation do not substantially alter the internal structure of these constructs or the relationships between them. However, it is critical that we add a disclaimer to this finding given that neither measurement nor structural invariance was exact, item orientation clearly has some impact on the measurement of psychological constructs. Thus, replication is needed with other instruments, samples, and analyses to ensure this finding is or is not specific to our study.
For the Applied Researcher
Applied researchers can use the results of this study to help their decision making, since previous work has cautioned against the use of negatively oriented items, and literature does not clearly direct instrument developers. Overall, findings based on an experimental design suggests that negatively oriented items provide minimal disturbances to the psychometrics of data from instruments used to measure the three psychological constructs based on the approach used herein. However, we also warn against blindly interpreting scores from instruments with positive and negatively oriented items. Instead, appropriate psychometric models should be employed and the extent of bias due to item orientation assessed. Although our results are favorable toward the inclusion of negatively oriented items, Dalal and Carter (2014) suggest careful reflection on the intended purpose of negatively oriented items before determining inclusion. Based on the minimal issues identified with the use of negatively oriented items, heeding their sage advice when using instruments with negatively oriented items may mitigate unintended challenges.
In this study, our research team was able to conduct a thorough psychometric investigation on several instruments—a luxury that applied researchers may not have due to a number of limitations (e.g., scope of study, sample characteristics, timing, sample size). Expecting that all applied research would partake in a robust analysis of the instrument would create unnecessary tension between simulated and applied research. As applied researchers design and propose studies, the intentionality justified above could be integrated early in the research process through prestudy practices that include justification for negatively oriented items and descriptions of a priori strategies to mitigate any anticipated measurement effects. Additionally, collecting validity evidence and pilot testing could be incorporated as part of the overall timeline, rather than examination expected to occur after data collection. Moreover, qualitative approaches, such as cognitive interviewing, are recommended in complement to improve validity (American Educational Research Association et al., 2014). This type of due diligence and transparency during the prestages of research will build confidence toward implementation, collection, analysis, and interpretation of data from negatively oriented items. Purposeful use or avoidance of negatively oriented items do not limit research, but rather challenge applied researchers to think more critically about how the target sample would respond to either.
Additionally, we would like to point out that our results may not generalize to all instruments given the measures we selected and the approach we used for writing item writing reversals. That is, issues with negatively oriented items depend on the item wording (i.e., use of reverse wording vs. negation), the instrument, the construct, and the care taken to develop all of these. In this project, we utilized a number of practices to minimize possible ill effects of item orientation:
We specifically avoided use of negative particles when revising items into negative orientation.
We worked with an applied linguist to write item reversals.
Our items were reviewed by both subject matter and linguistic experts.
Items were presented in a completely random order to participants; items of similar orientation were not grouped together, nor were items from the same instrument.
The measures we selected for inclusion all included a 50:50 mix of positively and negatively oriented items in the original; these constructs are already known to be suitable for measurement with negatively oriented items.
Additionally, we do not want researchers to think that item orientation or the type of negative orientation does not matter based on our findings. Instead, we urge researchers to employ appropriate measurement tools to examine effects of item orientation on scale properties and substantive results.
Finally, researchers should take care when interpreting results from different measures of the same construct and realize that different conclusions may occur due to the differences in item wording across instruments. Researchers should consider if the consistencies are occurring across studies with the same or different measures coupled with how items are phrased. We contend that findings will be stronger if consistency can be found when measuring a construct with different item orientation, particularly when best practices are followed in the item writing. In general, researchers should use caution when making broad conclusions from newly developed instruments or studies using only a single instrument to measure a construct.
Limitations
The present study has several limitations. Two items on the RSES may contribute to potential issues. RSES7 was accidentally included with positive orientation in all conditions. Additionally, RSES9 exhibited a negative residual variance in the bifactor model for the all negatively phrased items condition. The loading of RSES9 on its specific factor was fixed to zero in that model.
Despite our best efforts, the methodology of collecting data using MTurk resulted in convenience sampling. Specifically, our dimensionality results differed from previous research in the “original” condition, which tended to show a greater extent of multidimensionality than was found in our study. Also, our sample, like studies before ours, is based on a relatively highly educated sample and our results may not apply to samples that are less educated. Finally, the between persons design does not allow us to directly address how a given individual would differently treat the same item with different polarities.
Future Avenues for Research
In this study, we employed techniques common to the CFA framework for data analysis; another option would be to use techniques of the IRT framework. IRT provides a different set of tools for assessing common factor models, including properties such as invariance of item parameters. For example, in the IRT framework the influence of item noninvariance on total scores is often separately estimated for individuals in different score ranges, whereas in the CFA framework differential effects across the latent construct continuum are not typically considered. The IRT framework could be used to test hypothesized relationships between item orientation and participant responses, as it emphasizes conditional standard errors of measure. Despite these potential benefits to using an IRT perspective, we employed a CFA framework, as the research surrounding the use of bifactor indices for dimensionality assessment (Rodriguez et al., 2016) has focused on the CFA context and our power analyses for invariance testing were based on CFA analyses.
This study could also be expanded on by modifying the context of the research. That is, all of the research cited herein employed self-report instruments; whereas item orientation when the respondent is a rater or judge of others may present different effects. Additionally, given our finding of approximate but not exact structural invariance, further invariance testing of different path models using other instruments would help clarify this issue. Future research should strive for a more diverse sample that is more representative of the general population rather than just the highly educated and those able and willing to complete online surveys.
Finally, the psycholinguistic and emotional impact that positively and negatively oriented items have on participants is an open empirical question that we encourage our colleagues in related fields to explore more deeply. Last, innovate fMRI imaging and electroencephalogram tests may shed light on how humans’ process negatively and positively oriented survey items.
Conclusion
In comparison with previous studies (Benson & Hocevar, 1985; K. L. Cole et al., 2019; Greenberger et al., 2003; Marsh, 1996; Salazar, 2015; Sliter & Zickar, 2014), this large-scale randomized survey experiment and the modern analytical techniques implemented purposefully and uniquely explored our research question about item orientation. Applications of the results are widespread, as negative orientation is used across fields and constructs. Understanding the dimensionality and measurement invariance of both orientations within the same data set can provide more definitive results about the usefulness of this strategy as a cross-disciplinary psychometric tool.
Researchers are encouraged to reflect on the purpose of negatively oriented items. Although negatively oriented—as well as, negatively worded, negatively keyed, or reverse coded—items are included to mitigate expected response styles, unintended (albeit negligible in our study) consequences may also occur, causing issues such as induced multidimensionality. By making purposeful design decisions and communicating justifications with transparency, researchers and readers can responsibly engage in measurement and interpretation of the data collected. Ultimately, the value given to negatively oriented items relies on the expertise of the researcher and conditions of the applied context of interest. In conclusion, we ask that the results of this study, and other research into item-writing practices, be put into practice. Because of the breadth of this body of research, we advocate for the inclusion of a psychometrician or other item-writing expert (e.g., applied linguist) on all research teams, who is able to synthesize and apply this body of research into future studies.
Footnotes
Author Note
A previous version of this work was presented at the APA 2019 conference.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
