Abstract
The Rosenberg Self-Esteem Scale was administered with a 1–4, 1–5, or 0–100 scale to 819 participants, to compare score interpretations across the different versions. A rating scale utility analysis revealed that the categories in the 101-point scale were used inconsistently; based on the analysis, adjacent categories were collapsed resulting in a 7-point scale with almost identical psychometric properties as the original. The interpretations based on the 101-point scale could lead to misinterpretations when compared with the 4- and 5-point versions.
The use of Likert and Likert-type scales is ubiquitous in the fields of education and psychology. Because of people’s familiarity with the 0–100 scale in educational grading, some have argued that scales should be administered with response categories from 0 to 100, rather than the typical Likert scale (Pajares et al., 2001). Using the Rosenberg Self-Esteem Scale (RSES; Rosenberg, 1965), this study addresses issues in interpreting scales with a 0–100 scale, compared with a typical 4- or 5-point Likert scale. The RSES was selected because it measures a construct that people of different ages could relate to and had a long track record of validity studies when administered as a 4- or 5-point Likert scale. This study was designed to provide guidance to those who wish to take an existing 4- or 5-point Likert scale, but administer it as a 0–100 scale.
Research has been conducted on the efficacy of using a 100- or 101-point rating scale with mixed results, and different researchers focused on different aspects of the results. Maurer and Pierce (1998) compared traditional measures of self-efficacy, most from 0% to 100%, with the 5-point Likert format and found a similar factor structure, reliability, and predictive validity with class performance across the two versions. Toland and Usher (2016) used a self-efficacy measure, of mathematics, to compare a Likert scale with a 0–100 scale; using the Rasch rating scale model to analyze both versions of the scale, they found that item reliability was lower and item separation worse in the 0–100 scale than the 4-point Likert scale. Research comparing shorter 4- and 5-point scales with 100- or 101-point scales has been inconclusive in determining which is preferable, most studies demonstrating equivalent psychometric properties, while others, like Toland and Usher (2016), demonstrated that the shorter scale was preferable.
The purpose of this study was to compare the interpretations of the same scale when administered as a 4- or 5-point Likert-type scale (RSES4, RSES5) or on a scale from 0 to 100 (RSES100). First, Linacre’s (1999) Rasch-based rating scale utility analysis was conducted on all three versions. The rating scale utility analysis determines how much of the latent trait each response category covers and includes a set of guidelines that takes into account how the response categories were used, whether larger response categories were associated with higher levels of the trait, and whether the category width is too narrow or too wide to be meaningful. Using these guidelines, the researcher determines whether certain adjacent response categories should be collapsed, when categories are so narrow as to not have distinct meanings.
Method
Participants
There were 819 participants included in the analyses presented here, 53.4% male and 46.6% female, ranging in age from 18 to 85 (M = 36.5, Mdn = 33, SD = 11.9) years. Participants were recruited on Amazon’s Mechanical Turk and paid $0.75. Most participants had some college or at most a 4-year degree, 32.0% or 41.8%, respectively, while 16.2% had attended or completed graduate school, and 9.9% reported only high school, high school equivalent, or less. Most of the participants were from the United States (85.8%), 8% from India, and the rest from 21 other countries. A total of 87 participants who failed a validity check (an item with “Respond Not at all true for this item”) or responded in fewer than 100 s, which was one third of the median response time of the first 100 participants, were removed from the original sample of 906. Participants were randomly assigned to the three study conditions; there were no significant differences across the three groups with respect to age, sex, and education level.
Instruments
Three versions of the RSES were administered (RSES4, RSES5, and RSES100). The RSES is a 10-item self-report scale most frequently administered with four or five response categories (RSES4 or RSES5). Previous studies reported a coefficient alpha ranging from .72 to .85 (Gray-Little et al., 1997).
The third version was administered with a response scale from 0 to 100 (RSES100). The 0–100 scale was presented as a scale with a sliding pointer. The respondents slid the bar that was originally located at the left-hand side of the slider to indicate their level of agreement from 0 to 100, with markers indicating every 10-point interval. When the participants moved the bar, they could see the numeric equivalent of the pointer’s location.
To ensure that criterion validity was stable across the three versions of RSES, we administered the 20-item State Self-Esteem Scale (SSE; Heatherton & Polivy, 1991). The SSE has three self-esteem subscales: appearance, social, and performance, which correlate moderately with RSES (r = .68, .58, and .57, respectively; Heatherton & Polivy, 1991).
Results and Discussion
Descriptive Statistics and Psychometric Properties
Across the three versions of the RSES, there was little difference in item statistics, total scores, coefficient α (see Table 1), or criterion validity (see Table 2). (Note: To facilitate comparisons of summary statistics, average response means and standard deviations were transformed so that all three were on a 0–100 scale. This was done by subtracting 1 and multiplying by 33 and 1/3 or 25 for the 1–4 or 1–5 scale, respectively.)
Descriptive Statistics for Each Version of RSES.
Note. In item to total score correlations, total was computed without that item. RSES = Rosenberg Self-Esteem Scale.
Correlations Between Different Versions of RSES and SSE Subscales.
Note. The correlations were negative because lower scores on the RSES indicate higher levels of self-esteem. SSE-App = State Self-Esteem appearance subscale; SSE-Soc = State Self-Esteem social subscale; SSE-Perf = State Self-Esteem performance subscale; RSES = Rosenberg Self-Esteem Scale.
Rating Scale Utility Analysis
Following Linacre’s (1999) guidelines, the rating scale utility analyses were conducted on all three RSES versions using Facets version 3.71.4 (Linacre, 2014). The RSES4 and RSES5 demonstrated regular, adequate use of response categories, where increasing categories were associated with an increasing level of the latent trait, and each category was the most likely response at some point along the latent trait scale. These results indicated that no adjacent response categories needed to be collapsed. However, the RSES100 rating scale utility analysis revealed that many of the response options went unused by participants and even some response options that were used were never the most probable, meaning that many adjacent categories were associated with the same section of the latent trait scale. These findings indicated that many adjacent categories should be collapsed because the categories were, in fact, not interpreted by the participants as representing distinct locations along the latent trait scale. Several iterations of grouping adjacent categories resulted in seven categories, rather than the original 101. The RSES100.7 was based on the following regroupings: (0) 0–3, (1) 4–16, (2) 17–41, (3) 42–66, (4) 67–81, (5) 82–96, and (6) 97–100, and now met the guidelines for adequate rating scale use. After recoding the RSES100 responses into these seven categories, the descriptive statistics and correlations with the criterion measures were similar to the other three versions (see Tables 1 and 2).
For a given item, if subjects with the same underlying level of self-esteem can be associated with a wide range of values, then when scores are totaled over all items in the scale, this problem is magnified. The issue is that several people with the same level of self-esteem could have very different total scores. Figure 1 depicts the relationship between total scores from the RSES100.7 and the original total scores from RSES 100.

The range of total scores on the original RSES100 corresponding with each possible total score on RSES100.7.
Danger of Misinterpreting Scores From a 0–100 Scale
In RSES100, many of the categories did not represent distinct locations along the self-esteem scale. In other words, while respondents were selecting different numbers, the location along the scale of the underlying trait was not distinct. In Linacre’s rating scale utility analysis, it is critical to determine whether each category is, at some place along the scale, the most probable category. This situation was achieved only after collapsing many adjacent categories so that there were only seven categories, not the original 101. The inconsistency in the number of categories collapsed together, such as 0–3 versus 17–41 in the first and third categories, for example, is a result of the subjects’ uneven use of the response categories. A few extreme values were used as much, or much more than, many values in the middle, leading to the appearance of uneven categories. The rating scale utility analysis revealed that when accounting for the location of the respondents on the self-esteem scale, there were people with the same level of self-esteem who were responding with a wide range of values to the same item.
While some could consider collapsing categories as resulting in lost information, the techniques are designed so that adjacent responses are collapsed that were associated with the same location along the trait scale, so that while they were associated with a wide number of score points, they were not associated with unique levels of the trait. In fact, the act of collapsing categories mitigates against the over-interpretation of the original total scores.
For example, there were eight subjects with a total score of 30 on the RSES100.7, as seen in Figure 1. These eight subjects originally had scores on RSES100 ranging from 494 to 557. When practitioners interpret scores from RSES100, they would likely conclude that scores that differed by over 60 points must belong to people who significantly differ in self-esteem. The standard error of measurement (SEM) on RSES100 was 5.5 points, so while comparisons of scores within two SEMs, or 11 points, would be considered not significantly different, there are many scores well beyond 11 points that a reasonable person would interpret as different from each other when, in fact, they correspond to the same location on the self-esteem scale. Of the 50 different observed total scores on the RSES100.7 scale, 39 were associated with original RSES100 score ranges greater than two SEMs. The median range across the 50 RSES100.7 score totals was 47. This means that practitioners using scores from such a scale would conclude that people had very different levels of the trait of interest when, in fact, they are located at almost the same location along the underlying construct scale.
Conclusion
This study demonstrated that a scale with more response categories provided the illusion that there is more information or that respondents can better express themselves, when in fact, the scores may be misleading. The study did have several limitations: a small percentage of participants were likely non-native English speakers, which may have affected how respondents interpreted some of the items, and because the number of response options was only manipulated for one scale, generalizations are possible to all 101-point scales. Even with the limitations, it is hard to imagine a scale where the respondents could adequately distinguish among 101 scale points for a given construct. Because of this, there is the danger that scores on such a scale would be misinterpreted as being distinct when there is, in fact, no difference with respect to the underlying trait. Depending on how a proposed scale is to be used, decisions about the similarity or differences of individuals based on such a scale may be misleading, at best, or incorrect, at worst.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
