Abstract
Individual response style behaviors, unrelated to the latent trait of interest, may influence responses to ordinal survey items. Response style can introduce bias in the total score with respect to the trait of interest, threatening valid interpretation of scores. Despite claims of response style stability across scales, there has been little research into stability across multiple scales from the beneficial perspective of item response trees. This study examines an extension of the IRTree methodology to include mixed item formats, providing an empirical example of responses to three scales measuring perceptions of social media, climate change, and medical marijuana use. Results show extreme and midpoint response styles were not stable across scales within a single administration and 5-point Likert-type items elicited higher levels of extreme response style than the 4-point items. Latent trait of interest estimation varied, particularly at the lower end of the score distribution, across response style models, demonstrating as appropriate response style model is important for adequate trait estimation using Bayesian Markov chain Monte Carlo estimation.
Self-reported measures are common in education, the social sciences, and psychology, with instruments often using Likert-type items to measure attitudes and personalities of respondents (Böckenholt, 2017). Self-report Likert items vary in their format; some have midpoint responses (i.e., odd numbers of response categories) and some do not (i.e., even numbers of response categories). Despite their prevalence of use, Likert items and instrument scores using Likert items are prone to response style bias. A response style can be thought of as the tendency to respond to items in some systematic way other than the trait of interest (TOI) or as an interaction of personal disposition and situational factors (Plieninger & Meiser, 2014).
Bolt and Johnson (2009) have argued that response style may reflect a respondent’s attempt to reduce the cognitive demand of distinguishing between differing levels of agreement or disagreement. For example, respondents may have differing opinions over what constitutes strong agreement—it is possible that they endorse a construct equally, but only differ in their interpretation of the response anchors. Because response option anchors are not on a continuum, respondents can also vary in their interpretation of the distance between response categories. This further obfuscates accurate interpretations of survey results as differing scale scores may be due, in part, to response style bias.
There are varying kinds of response styles that manifest depending on item format or the personality and background of respondents (Clarke, 2001; Harzing, 2006; Paulhus, 1991). Some of the response styles commonly cited in the literature are acquiescence response style (ARS), disacquiescence response style (DRS), extreme response style (ERS), and midpoint response style (MRS). ARS and DRS are characterized by the propensity to agree or disagree, respectively, without regard to the construct being measured. MRS is the tendency of respondents to pick the midpoint or neutral response on a scale with an odd number of categories while ERS is characterized by respondents endorsing the extreme options at the endpoints of scales. These response styles are considered content-independent, trait-like constructs that are stable over time and across subscales during a single test administration (Austin et al., 2006; Baumgartner & Steenkamp, 2001; Jin & Wang, 2014; Wetzel et al., 2013).
The presence of response styles in self-reported measures is a concern, as the validity of instruments using these measures is threatened by the unaccounted for response styles. Respondents’ preference for extreme responses can bias measurements of traits of interest and can confound interpretations of scores, thus interfering with the psychometric study of an instrument. For example, items may appear to function differently in their measurement of the TOI when respondents vary only in response style (Bolt & Johnson, 2009; Park & Wu, 2019). The presence of response styles presents a host of other issues. ERS can lead to spurious correlations between constructs as respondents may have inflated or deflated scores in unrelated constructs (Jeon & De Boeck, 2019; Park & Wu, 2019; Paulhus, 1991), which may lead to overestimation or underestimation of the construct of interest and score inflation (Park & Wu, 2019) or manifest as an issue is diagnostic surveys where a respondent’s score plays a role in a clinical diagnoses or access to services. Furthermore, response styles threaten dimensionality, construct validity, predictive validity, and the reliability of self-reported measures (Baumgartner & Steenkamp, 2001; Cheung & Rensvold, 2000; De Jong et al., 2008; van Herk et al., 2004; van Rosmalen et al., 2010). This can result in differential item functioning, a lack of measurement invariance, and present negative effects on parameter recovery (Bolt & Johnson, 2009; Liu et al., 2017). Going further, response styles threaten comparisons of diverse groups and international or cross-cultural assessment. There are many documented correlates (i.e., age, individual traits, or cultural background) to different types of response styles that often vary by culture or region, thus making accurate comparisons difficult (Harzing, 2006; Khorramdel & von Davier, 2014).
Measuring Response Style
Researchers have developed several methods to measure response style and response style bias. More simplistic methods entail counting the number of extreme or midpoint responses (e.g., Harzing, 2006). However, these methods do not account for the influence of the target trait on responses (Bolt & Newton, 2011). For example, a person providing many very low scores on a set of items could be an extreme responder, tending to select the most extreme option on the low end of the scale, but could also be very low on the substantive TOI. The classical approach does not address the confounding of the substantive TOI and, in this example, ERS.
Providing a promising approach to address issues with the classical approach, Wetzel et al. (2016) use a trait-state-occasion model for ERS and ARS, building on Weijters, Geuens, and Schillewaert (2010b). They measure the stability of response styles as the amount of variance in the state response style factors that is explained by the trait response style factors. Wetzel et al. (2016) also use the latent-state model, which measures stability of response styles as the correlations between time-specific response style factors across waves. However, indicators of response style were based on sum scores for a random sample of items and do not utilize item-level responses.
Item-level approaches utilize various versions of item response models. Jin and Wang (2013) provide a thorough overview of item response theory (IRT) models for measuring ERS. They describe how the Partial Credit Model (PCM; Masters, 1982) can be extended into a latent class PCM, as done in Moors (2008) and van Rosmalen et al. (2010), identifying latent classes of extreme and non-extreme responders. Others (e.g., Bolt & Johnson, 2009; Bolt & Newton, 2011; Johnson & Bolt, 2010) used multidimensional IRT to simultaneously model one or two substantive traits of interest and one trait corresponding to ERS using a multidimensional nominal response model. This approach ignores the ordinal nature of Likert-type items. In a method that does account for ordinal items, Johnson (2003) uses random thresholds, allowing for individual respondents to have their own thresholds in a probit model for ordinal responses. A respondent’s distance between adjacent thresholds is used as an indicator of ERS propensity, with larger distances indicating use of ERS more frequently. Johnson’s (2003) approach does not model response style as a separate latent trait, one drawback of the approach.
A recent class of approaches use multiprocess, multidimensional item response trees (IRTrees) to model hypothesized response subprocesses (e.g., Böckenholt, 2012; Böckenholt & Meiser, 2017; De Boeck & Partchev, 2012; Khorramdel & von Davier, 2014). While the modeling has been developed in the last decade, early efforts (Hart, 1923) theorized respondents used a decision-making series of steps, or subprocesses, when responding and developed a questionnaire in order to assess this decision based approach. These subprocess models combine processing trees of binary decision nodes with IRT parameterizations of node probabilities (Meiser et al., 2019). The decision nodes represent response subprocesses and the continuous latent variables related to those processes (i.e., target traits and response styles; Khorramdel et al., 2019). In practice, these models are used to disentangle ordinal item responses into binary pseudo-items based on the decision tree specification (Khorramdel & von Davier, 2014) and the resulting pseudo-items are used to measure response styles and target traits. Thus, the IRTree addresses many of the shortcomings of previous approaches in that they use item-level responses, preserve the ordinal nature of items, and treat response styles separate from the latent TOI to avoid confounding.
Consider a five-category Likert item, with response options “strongly disagree (1),”“disagree (2),”“neither agree nor disagree (3),”“agree (4),” and “strongly agree (5).” This item can be decomposed into three subprocesses (i.e., three decision stages), with each subprocess driven by a separate latent trait. The first subprocess for items on these scales represents whether a respondent has an opinion on the construct or not. That is, the respondent either selects “neither agree nor disagree (3),” the midpoint, or one of the other options, either in agreement with the item’s content or not. Midpoint responses resulted in MRS pseudo-items coded as 1 and the remaining TOI and ERS pseudo-items coded as missing because a respondent cannot simultaneously have an opinion and not have an opinion on the item’s content.
The second subprocess represents a respondent’s decision of whether they are high or low on the target trait (i.e., choosing between agreement or disagreement). A respondent choosing “agree (4)” or “strongly agree (5)” has the TOI pseudo-items coded as 1 with responses of “disagree (2)” or “strongly disagree (1)” coded as 0 to indicate endorsement (or not) of the TOI.
The third subprocess constitutes respondents’ decision to choose the extreme response option or not (e.g., choosing between “strongly agree (5)” and “agree (4)” or “strongly disagree (2)” and “disagree (2)”). Respondents selecting “strongly disagree (1)” or “strongly agree (5)” had ERS pseudo-items coded as 1 while respondents selecting “disagree (2)” or “agree (4)” had ERS pseudo-items coded as 0. Hence, a respondent selecting “agree” on an item has pseudo-item responses [010] reflecting failure to endorse the midpoint response, agreement with the target trait, and failure to endorse the extreme response option, respectively. A respondent selecting “neither agree nor disagree” would have pseudo-item responses [1-missing-missing]. Figure 1 and Table 2 present this subprocess and pseudo-item coding.

Three-process IRTree.
For the three-decision IRTree (i.e., Figure 1), probabilities of each observed response option to item j are computed for respondent i by tracing the path from the MRS-driven subprocess through the second (TOI) and third (ERS) subprocesses. The respondent either endorses the middle category, “Neither Agree nor Disagree,” with probability
where
The probability of endorsing each pseudo-item is parameterized using an IRT two-parameter logistic (2PL) model via:
where
Stability
Measurement of response styles in this way are important because unaccounted for response styles are a threat to instrument validity and the validity of comparisons made across groups. However, there has been little research into response style stability from the beneficial perspective of IRTrees. Stability in individuals’ response styles can be considered across time, across instruments, and across items within the same instrument. Across time, previous research indicates response styles are stable over both short and long periods. Stability has been found over a 1-year period (ARS, ERS, DRS, and MRS: Weijters, Geuens, & Schillewaert, 2010b), a 2-year period (ARS and ERS: Bachman & O’Malley, 1984), a 4-year period (ARS: Billiet & Davidov, 2008), and an 8-year period (ARS and ERS: Wetzel et al., 2016). Across-time stability suggests that the perceived stability of personality and attitude constructs may be ascribed to response style stability. Besides the methods just referenced, Ames and Leventhal (2019) present a longitudinal IRTree for measuring changes in both TOI and response style across multiple time points. Within the same instrument, across item response style stability implies that individuals systematically use the response scale in a similar manner across items (e.g., ARS and ERS: Weijters, Geuens, & Schillewaert, 2010a; ARS: Danner et al., 2015).
Across-scale stability implies respondents systematically use the response scale in a similar manner across scales (e.g., ERS: Aichholzer, 2013;Bolt & Newton, 2011; Wetzel et al., 2013; ARS & ERS: Weijters, Geuens, & Schillewaert, 2010a). However, many previous across-scale studies used related constructs. For example, Bolt and Newton (2011) examined two science subscales from the Program for International Student Assessment (PISA; OCED, 2006), measuring a respondent’s “Enjoyment of Science” (ENJ) and “Value of Science” (VAL), each item a forced-choice, 4-point Likert-style item. The ENJ and VAL scales had a positive correlation of .580. Park and Wu (2019) use only Rosenberg’s Self Esteem Scale to estimate ERS, ARS, and DRS. Another two studies measure ERS and MRS using the Personal Need for Structure questionnaire containing two substantive dimensions (Böckenholt & Meiser, 2017; Meiser et al., 2019). In the latter study, the estimated correlation between the two substantive dimensions was large (r = .75). Jeon and De Boeck (2019) measure ERS in three multiple subscale assessment data samples using the Big Five personality assessment (Goldberg, 1992), the DISC personality assessment (Rosenberg & Silvert, 2013), and the Humor Styles Questionnaire (Martin et al., 2003). In Jeon and De Boeck’s analyses, the magnitudes of correlations between subscales averaged at .22 for the Big Five assessment, .39 for the DISC assessment, and .25 for the Humor Styles Questionnaire. Khorramdel and von Davier’s (2014) analysis of Big Five personality traits from the International Personality Item Pool also showed a moderate relationship between subscales, with magnitudes of correlations between the subscales averaging at .33. In another study, Khorramdel et al. (2019) used the PIAAC with four subscales measuring work-related tasks—the average magnitude of correlations between subscales was .64.
Examining across-scale studies from unrelated constructs, and different item types, will improve the field’s understanding of response style stability and how to accurately account for response styles when simultaneously measured with substantive traits. Bolt and Newton (2011) found improved measurement of ERS when using simultaneous analysis of multiple scales, but that improved measurement of TOI latent traits was smaller. This pattern may not hold for unrelated constructs, or those with lower intertrait correlations. Baumgartner and Steenkamp (2001) recommend heterogeneous items to measure response style—either different content than the substantive TOI or to include both positive and negatively worded items. The different-content approach has not been evaluated using IRTrees; the result of this research may help guide future measurement of response styles and designing questionnaires to account for response styles such as ERS and MRS. To address the gap in the current literature around response styles, this study addresses the following research questions using an empirical data set:
Method
As detailed, previous studies of response styles used either a single scale, or scales with correlated subscales to estimate response style. This study estimates response styles using three attitudinal subscales measuring opinions on medicinal cannabis, climate change, and social media. The current study uses multiple IRTree models to determine the stability of response styles across responses to three subscales. This novel approach to the estimation of response styles will lend evidence in support of, or against, the hypothesized content independence and stability of response styles. Validating a model for disentangling response styles from substantive traits of interest will help future researchers to draw valid conclusions from surveys using Likert scales.
Instrumentation
The medicinal cannabis subscale intends to measure an individual’s perception on the appropriateness of using cannabis for medicinal purposes and contains 11, 5-point Likert items. The climate change subscale intends to measure an individual’s perception on the change in global or regional climate patterns and contains 10, 5-point Likert items. For the medicinal cannabis and climate change subscales, higher scores indicate greater endorsement of medicinal cannabis and stronger belief in human impact on climate change and its negative consequences. The Likert social media scale contains 13, 4-point Likert items intended to measure the influence of engagement in social media on one’s perceptions of their image portrayed in the social environment and self-image. Higher scores on the social media subscale indicate less authenticity in one’s online persona. The 5-point Likert scales contained items with responses ranging from “strongly disagree” to “strongly agree.” The 4-point Likert scale contained items with responses ranging from “strongly disagree” to “strongly agree.” Four items on the climate change subscale and five items on the social media subscale that were negatively worded were recoded before the subsequent analysis. The relationships among study scales are presented in Table 1, with reliability coefficients on the diagonal of the matrix. Reliability of each subscale was assessed using Cronbach’s (1951)coefficient alpha. Results generally supported the reliability of each subscale, with each of the medicinal cannabis and climate change subscales having coefficient alpha≥ 0.90 (Nunnally & Bernstein, 1994) and the social media subscale had a coefficient alpha value of 0.80.
Relationship Among Study Scales.
Note. Reliability coefficients are enclosed by parenthesis. Midpt. represents the proportion of midpoint responses for the scale, and Ext. represents the proportion of extreme responses.
In the current study, the scales are from different constructs and two uncorrelated scales (i.e., medicinal cannabis and social media are uncorrelated; r = .01). Unrelated, uncorrelated scales have yet to be examined in the study of response style stability. Table 1 presents the relationship among raw scores of the scales, as well as other scale information. If response styles are truly stable traits, then individuals with high levels of response tendencies on one scale should also show high levels of the same response tendency on another scale. That is, we would expect moderate to large positive correlations among response style traits across scales.
Sample
The data used in the study come from responses collected in fall 2018 (n = 1199) collected via the online survey platform Qualtrics. Of respondents that gave their gender, 27% were male, 72% were female, and the remainder specified another gender identity. The majority of respondents completed some form of higher education, with 41% holding a master’s degree or higher, 35% holding a 2-year or 4-year degree, and only 23% reporting their level of education as some college or less. The median age of respondents was 33.
IRTrees and Binary Pseudo-Items
Responses to the medicinal cannabis and climate change scales were recoded into binary pseudo-items representing ERS, TOI, and MRS, depicted in Figure 1 and Table 2. Responses to the social media subscale, with no midpoint for the items and four response options, were coded into pseudo-items representing TOI and ERS, without the MRS subprocess. Hence, a respondent responding “strongly disagree” on a social media item would have pseudo-item responses [01] indicating disagreement with the social media TOI and selection of the extreme response option, respectively. This 4-point Likert item is presented in Figure 2.
Pseudo-Item Coding Matrix for 5-Point Likert Items.
Note.

Two-process IRTree.
Modeling Approach
To answer the three research questions, a series of models were fit to the data. Model 1, the “Full” model, estimates all traits simultaneously. That is, the model has two MRS traits, one per each 5-point scale; three ERS traits, one per each scale; and three TOI traits, one per each scale, for a total of 8 latent traits (see Figure 3). Each latent trait in the Full model is informed only by items in the specific scale. For example, while the MRS trait for climate change and medicinal cannabis are estimated simultaneously in the Full model, the MRS pseudo-items from the climate change scale only inform the climate change MRS trait. Those pseudo-items do not inform the medicinal cannabis MRS trait.

Full model.
Model 2, the “Single Response Style” model, estimates one trait for MRS and one trait for ERS to represent the case of response style stability. Specifically, there is one MRS trait, estimated across all 5-point scales. Specifically, all MRS pseudo-items from the climate change and medicinal cannabis inform the single MRS trait. There is one ERS trait, estimated across all scales, and all ERS pseudo-items inform the single ERS trait. Finally, there are three TOI traits, one per each scale, for a total of 5 latent traits (see Figure 4).

Single response style model.
Models 3 to 5 fit an IRTree to each scale separately. That is, a three-process model like in Figure 1 was fit separately to the medical cannabis scale and to the climate change scale. Models 3 and 4 have three latent traits each (MRS, TOI, and ERS). A two-process model like, in Figure 2, was fit to the social media scale. Model 5 has two latent traits (TOI, ERS).
Stable across-scale response styles (research question 1) will be evidenced by a consistent ranking of respondents’ estimated response styles across Model 1 and Model 2. Furthermore, stable response styles will favor Model 2. Stable across-item response styles (Research Question 2) will be evidenced by a consistent ranking of respondents’ estimated ERS across Model 1. The impact of model specification on TOI estimation (Research Question 3) will be examined by comparing TOI estimates across all models.
Estimation
Model estimation occurred in Mplus using Markov chain Monte Carlo (MCMC) estimation with the Gibbs sampler (Muthén & Muthén, 1998-2017). Two chains were estimated with the first 25,000 iterations from each chain discarded as burn-ins and the posterior distribution consisted of the last 25,000 iterations from each chain. Weakly informative priors were used for the item parameters:
Convergence was assessed via visualization of trace plots and evaluation of the potential scale reduction factor (PSRF) convergence criteria (Gelman & Rubin, 1992). The minimum PSRF convergence criteria was set to 1.02. From each of the five models, the latent trait expected a priori (EAP) estimate was saved. Marginal DIC used to compare full, single response style, and individual models.
Several previous studies provided guidelines for parameter recovery of IRTrees. For example, Ames and Myers (2020) found mean bias of near zero for item difficulty parameters in IRTrees at the lowest sample size of their simulation (N = 1,000). Wang and Nydick (2015) evaluated a similar noncompensatory multidimensional IRT model, suggesting sample sizes of 1,000 are required for adequate parameter recovery without missing data (Wang & Nydick, 2015). As such, the sample size of 1,199 was deemed adequate.
Results
Model Comparison and Evaluation
Marginal DIC favored the “full” model (DIC = 19910.92) over the “single response style” model (DIC = 25344.16), indicating that the model with separate MRS and ERS per scale provided a better fit to the data than a model that estimated a single MRS and single ERS trait across all scales. This provides the first piece of evidence that scale-specific response styles provide a better fit to the data than a single estimate of response style across models.
Further model evaluation was performed using Posterior Predictive Model Checking (PPMC). PPMC methods are useful in evaluating absolute model fit for IRT models (see Sinharay, 2005, 2006, for an exploration and description of PPMC for IRT). PPMC is based on the simple, intuitive principle that the data generated from the posterior distribution(s) should resemble the observed data if the model provides adequate fit for the data. The posterior predictive distribution (PPD) for replicated data is a distribution of future observable data conditioned on the observed values (Rubin, 1984). That is, PPD is the distribution of simulated, or “replicated,” data that would arise from the model and parameters estimated from observed data. Computing the replicated samples is accomplished by randomly drawing values of the parameter(s) from the posterior distribution, which are then used to simulate the data using the model of interest. This process is repeated many times—for this study, 1,000 times. A discrepancy statistic is computed for each of the 1,000 replicated data sets, which is compared to the same discrepancy measure computed from the observed data. A posterior predictive probability (PPP) value is the proportion of replicated discrepancy statistics larger than the same discrepancy statistic calculated under the observed data. PPP values near 0.5 indicate that no systematic differences exist between replicated data and observed data, of the discrepancy statistic. Thus, PPP values near 0.5 indicate model fit and PPP values near 0 or 1 indicate systematic differences between replicated data and observed data.
Leventhal (2017) describes four approaches to PPMC for IRTrees. Because the purpose of this study is on person-fit of the ERS and MRS traits, Leventhal’s fourth approach, exploration of the proportion of individuals who selected extreme response options and midpoint response options as discrepancy measures, is used in this study. Specifically, 1,000 replicated data sets for the “full” model (Model 1) and the “single response style” model (Model 2) were generated. Across each respondent, the proportions of extreme and midpoint responses were computed for the 1,000 replicated data sets and the observed data, resulting in four PPP values per respondent: one PPP value per model, per trait.
PPP values were averaged across all respondents with the results presented in Figures 5 and 6 for ERS and MRS, respectively. Respondents were “binned” based on their average proportion of extreme responses and midpoint responses in the observed data. For example, a respondent who provided 11 extreme responses across the 34 items of the three scales provided a proportion of .32 extreme responses. They were included in the bin .3. Figure 5 shows that for Model 1, the full model, average PPP values are near .5 for most bins of extreme responders. For respondents with a low level of extreme responses (i.e., extreme proportions of .2 and below), the single response style model (Model 2) does a poor job of modeling their extreme responses, with average PPP values above the .95 level. The high average PPP values indicate Model 2 tends to overestimate the extreme responses in the replicated data for those providing an initially low level of extreme responses. As the proportion of extreme responses increases, the average PPP values decrease for Model 2, approaching the average PPP value for Model 1 for high extreme responders.

Average PPP values: Proportion ERS as discrepancy value.

Average PPP values: Proportion MRS as discrepancy value.
Figure 6 shows that for Model 1, the full model, average PPP values are between .2 and .5 for most bins of midpoint responders. For respondents with a low level of midpoint responses (i.e., midpoint proportions of .1 and below), the single response style model (Model 2) does a poor job of modeling their midpoint responses, with average PPP values above the .95 level. The high average PPP values indicate Model 2 tends to overestimate the midpoint responses in the replicated data. As the proportion of midpoint responses increases, the average PPP values decrease.
Midpoint Response Style
Table 3 provides the Pearson correlations for EAP estimates of MRS for Models 1 to 4 in the upper triangle. The social media scale is a 4-point scale that does not include a midpoint. Thus, no midpoint pseudo-item is created and no MRS trait is estimated for the social media scale. In addition, EAP estimates were rank-ordered into 5 groups and Table 3 also includes Spearman rank-order correlations for Models 1 to 4 in the lower triangle. All correlations were positive and statistically significant. However, the magnitude of correlations varied considerably.
Pearson and Spearman Rank-Order Correlations for MRS Trait Among Models.
Note. MC is medicinal cannabis; CC is climate change; Full is the “full” model (Model 1); Single RS is Model 2, the “Single Response Style” model; and Alone is Model 3 or 4 in which a single scale is fit to the IRTree.
The largest Pearson correlation was between the MRS EAP estimates for climate change in the full model (Model 1) and the single-scale climate change model (Model 4; r = .983), representing a large effect size (Cohen, 1992). The smallest correlation was between the single-scale medicinal cannabis (Model 3) and single-scale climate change (Model 4; r = .331), representing a small to moderate effect size (Cohen, 1992). When comparing the MRS EAP estimates from the full model (Model 1) to the single response style model (Model 2) the correlations were large and positive (r = .691, .893 for the medicinal cannabis and climate change scales, respectively), but not as strong as the correlations from the full model (Model 1) with the single-scale models (r = .937, .983 for the medicinal cannabis and climate change scales, respectively). Taken together, evidence implies a positive relationship between MRS estimates across traits, but that the single response style model (Model 2) does not measure an individual respondent’s level of the MRS trait as well as when the trait is estimated separately by scale.
Spearman rank-order correlations are found in the lower triangle of Table 3, revealing correlations similar in magnitude the Pearson correlations. Rankings are especially revealing in visualizing the differences among the model MRS estimates, presented in Figure 7. Ranks for the single response style model (Model 2) are given different shapes and shadings. For example, the open triangle is the top-ranked group from Model 2. For the most part, rankings are preserved across models. However, rankings were inconsistent between the full model and single response style model (see Figure 7 and Table 4) for both medicinal cannabis and climate change. For example, only 57.32% of individuals ranked in the lowest group for MRS from the medicinal cannabis in Model 1 were ranked in the lowest group in Model 2. Of those ranked in the second-lowest group for MRS from the medicinal cannabis in Model 1, only 28.75% were ranked in the second-lowest group from the single response scale (Model 2). Rankings were somewhat more consistent for climate change MRS estimates (e.g., 74.9% of individuals ranked in the lowest group for MRS from the climate change in Model 1 were ranked in the lowest group in Model 2). These findings illustrate that the single response style model does not adequately represent MRS traits across multiple scales, providing further evidence for Research Question 1 that response styles are not stable across unrelated scales.

Midpoint response style relationship across models.
Crosstabs of MRS Ranks (Percentage).
Extreme Response Style
Table 5 provides the Pearson correlations for EAP estimates of ERS for Models 1 to 5 in the upper triangle. In addition, EAP estimates were rank-ordered into 5 groups, and Table 5 also includes Spearman rank-order correlations for Models 1 to 5 in the lower triangle. Similar to measurement of MRS, all correlations were positive and statistically significant. The largest Pearson correlation was between the ERS EAP estimates for social media in the full model (Model 1) and the single-scale social media model (Model 5; r = .966), representing a large effect size (Cohen, 1992). The smallest correlation was between the single-scale medicinal cannabis (Model 3) and single-scale social media (Model 4; r = .213), representing a small to moderate effect size (Cohen, 1992). When comparing the ERS EAP estimates from the full model (Model 1) to the single response style model (Model 2) the correlations were large and positive (r = .909, .721, .491 for the medicinal cannabis, climate change, and social media scales, respectively), but not as strong as the correlations from the full model (Model 1) with the single-scale models (Models 3-5; r = .939, .961, .966 for the medicinal cannabis, climate change, and social media scales, respectively). Similar to measurement of MRS, evidence implies a positive relationship between ERS measures across traits, but that the single response style model (Model 2) does not measure an individual respondent’s level of the ERS trait as well as when the trait is estimated separately by scale.
Pearson and Spearman Rank-Order Correlations for ERS Trait Among Models.
Note. MC is medicinal cannabis; CC is climate change; Full is the “full” model (Model 1); Single RS is Model 2, the “Single Response Style” model; and Alone represents Models 3 to 5 in which a single scale is fit to the IRTree.
Spearman rank-order correlations are found in the lower triangle of Table 5, revealing correlations similar in magnitude the Pearson correlations. Rankings were inconsistent between the full model and single response style model (see Figure 8 and Table 6), especially for the social media scale estimates. For the medicinal cannabis scale, with a large proportion of extreme responses (Table 1; 0.46), the single ERS trait adequately captures the rankings. Not so for the social media scale, which had only 0.08 proportion of extreme responses. For example, only 45.19% of individuals ranked in the lowest group for ERS from the social medial scale in Model 1 were ranked in the lowest group in Model 2. Of those ranked in the second-lowest group for ERS from the social medial scale in Model 1, only 20.33% were ranked in the second-lowest group from the single response scale (Model 2). Rankings were more consistent for medicinal cannabis ERS estimates (e.g., 85.77% of individuals ranked in the lowest group for ERS from the medicinal cannabis in Model 1 were ranked in the lowest group in Model 2). These findings illustrate that the single response style model does not adequately represent ERS traits across multiple scales, again providing further evidence for Research Question 1 and that response styles are not stable across unrelated scales and different item types.

Extreme response style relationship across models.
Crosstabs of ERS Ranks (Percentage).
Trait of Interest
While stability of response style is the focus of the study, estimation of response style will impact TOI estimation in the multidimensional model. Figure 9 illustrates a scatter plot of the EAP estimates across all 5 models. Similar to response styles, the same-trait, different model estimates are strong and positive. However, for medicinal cannabis trait estimation, the model for response style also seems to impact estimation of TOI. For example, comparing the medicinal cannabis EAP estimates from the Full model (Model 1) to the single response style model (Model 2), particularly at the lower end of the scale, showed a distinct fan shape at the lower end of the scale. The same shape is seen for the climate change scale, but not for the social media scale. In fact, the estimation of the social media scale was remarkably stable across models, as indicated by the very tight scatterplot between model estimates.

Trait of interest relationship across models.
Discussion
Using an empirical example, exploring research question 1 showed extreme and midpoint response styles were not stable across scales within a single testing administration, challenging the notion that these systematic response tendencies such as ERS and MRS are stable across scales (e.g., for ERS: Aichholzer, 2013;Bolt & Newton, 2011; Wetzel et al., 2013). The single response style model is not appropriate for response style measurement across multiple, distinct scales. In support of this notion, marginal DIC favored the “full” model over the “single response style” model, indicating that the model with separate MRS and ERS per scale provided a better fit to the data than a model that estimated a single MRS and single ERS trait across all scales. This was supported by the PPMC approach, which indicated that proportions of extreme and midpoint responses were not adequately captured using the “single response style model,” tending to overestimate extreme and midpoint responses in the replicated data for respondents with a low level of observed extreme and midpoint responses. On average these proportions tended to be captured using the “full” model.
There were positive correlations in measures of response style across models and traits. These finds agree with previous attempts to validate the use of IRTrees. Pleininger and Meiser (2014) examined responses to nine, 7-point ordinal items measuring self-confidence items that were recoded into dichotomous pseudo-items, reflecting MRS, a content-related process (i.e., TOI), and ERS. To evaluate the validity of the response style processes, 60 heterogeneous items were used as an extraneous response style measure, which was correlated with the ERS process. Pleininger and Meiser (2014) found convergent evidence of construct and criterion validity of the multiprocess model, citing a strong correlation between the peripheral measure of ERS and ERS from the IRTree, and between the peripheral measure of MRS and MRS from the IRTree. This study extends beyond Pleininger and Meiser (2014) by examining interrelationships of multiple scales and measures of response style across scales.
To answer Research Question 2, across item types, ERS traits tended to be positively correlated (Table 5). However, the correlations were stronger between items types that had the same number of Likert response options (i.e., 5) than between item types that had a different number of response options. This agrees with Weijters, Cabooter, and Schillewaert (2010), who reported that including a midpoint (i.e., 5-point Likert items) resulted in different levels of extreme responses than items without a midpoint (i.e., 4-point Likert items). However, in this study the inclusion of a midpoint resulted in more extreme responses. The 5-point items (medicinal cannabis, climate change) contained a higher proportion of extreme responses (.46 and .25, respectively) than the 4-point items (social media; .08). The large proportion of extreme responses for medicinal cannabis is the likely reason for the strong correlation between the single response style model measure of ERS (Model 2) and the measurement of a medicinal cannabis-specific measure of ERS in Model 1 (r = .909) and Model 3 (r = .854) and consistent rank-ordering of individuals across models in terms of their ERS estimates. In contrast, the low proportion of extreme responses for social media resulted in a much weaker correlation between the single response style model measure of ERS (Model 2) and the measurement of a social media-specific measure of ERS in Model 1 (r = .491) and Model 3 (r = .397). Of particular note is the magnitude of the correlation for social media, a 4-point Likert scale, between ERS EAP estimates of Model 1 and Model 2 (r = .491), which is much weaker in comparison to the other scales. Because the scales contain different content, it is unclear whether the difference in proportion of extreme responses is due to the item format or content.
The climate change scale contained a larger proportion of midpoint responses (.22) than the medicinal cannabis scale (.10). Similar to ERS, the larger proportion of midpoint responses for climate change is the likely reason for the stronger correlation between the single response style model measure of ERS (Model 2) and the measurement of a climate change-specific measure of MRS in Model 1 (r = .893) and Model 4 (r = .894). However, neither scale had a consistent rank-ordering of individuals across models, particularly for the middle three of five ranked groups. This is in contrast to the findings of Jeon and De Boeck (2019) who state that it is “reasonable to assume a general extreme responding latent variable that overarches the extreme responding latent variables across multiple sub-scales pp. 533.” They did not, however, compare across different traits, but across sub-scales representing similar traits. In agreement with that study, this study supports the use of generalized IRTree models that are flexible in the structure of the latent traits.
To aid in answering Research Question 3, Table 7 presents the correlations of TOI estimates across models. Correlations among TOI estimates tended to be highest within the full model, and weakest when comparing estimates among estimates of the single-scale models. However, the differences were negligible. Examining EAP estimates across models revealed important results that an appropriate response style model is important for adequate TOI estimation using MCMC estimation. Examining the scatterplot (Figure 5) revealed that measurement of TOI varied at the lower end of the score distribution, across response style models for two of the three scales (i.e., medicinal cannabis and climate change), but not for the third scale (i.e., social media). The medicinal cannabis and climate change are both 5-point Likert scales and had high proportions of extreme responses, which could be one reason for varying estimation of TOI across response style models.
Correlations Among TOI Estimates.
Limitations
Even though relative fit using marginal DIC was used to compare models, absolute fit of each model was not evaluated. Mplus provides posterior predictive model checking (PPMC) using the chi-square discrepancy statistic, but this discrepancy statistic has not been evaluated for use with IRTree models. Future research should examine model-data fit evaluation using additional discrepancy measures for PPMC for IRTrees. This study used the proportion of midpoint and extreme responders as discrepancy measures, but their use could be further examined with a simulation study. In addition, the choice of weakly informative priors on item parameters requires more investigation, not only in the specific models presented here, but for IRTrees in general.
Order of the scales remained constant across respondents: first, medicinal cannabis; second, climate change; and last, social media. There is limited research into item order, scale length, and whether the cognitive load of responding to multiple surveys impacts how respondents rely on response styles in providing responses. In addition, the presence of negatively worded items warrants more consideration. Park and Wu (2019) found that individuals tended to avoid using the lower extreme category, regardless of whether the item was positively or negatively worded, but future research could examine whether scales without negatively worded items showed more stability in response styles. Despite the need for more research, this study provides an important contribution to research on the accurate estimation and conceptualization of response style latent traits.
Footnotes
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
