Psychological Distance Between Categories in the Likert Scale

Abstract

This study examined whether the number of options in the Likert scale influences the psychological distance between categories. The most important assumption when using the Likert scale is that the psychological distance between options is equal. The authors proposed a new algorithm for calculating the scale values of options by applying item response theory and the ideas of Wakita to reveal the influence of the number of categories. Three types of questionnaires that were composed of the same items, but used different numbers of options to assess these items (specifically, 4-, 5-, and 7-point scales), were completed by 722 undergraduate students. The results indicated that the number of options influenced the psychological distance between options, particularly for the 7-point scale. This influence was revealed only by the authors’ algorithm; descriptive statistics and coefficients of reliability did not show that the number of options had a prominent influence. The importance of the number of options and the new algorithm are discussed.

Keywords

Likert scale number of options item response theory

Background

The Likert scale is the most commonly used psychometric scale among psychological measurements that require self-reporting. For this scale, it is assumed that if the psychological distance between categories is equal, the scale will provide exact measurements of the psychological trait being assessed. This assumption about the psychological distance between categories is the most important factor in the Likert scale. However, no conclusion has been reached regarding the influence of a different number of options on the Likert scale, and no previous research has examined the impact of the number of options on the psychological distance between the options.

The number of options has been a central issue for researchers in extracting information from participants since Garner (1960) reported that psychological scales require more than 20 categories to derive complete information from answers. A decade later, Green and Rao (1970) reported that six or seven categories were appropriate. In contrast, Schuts and Rucker (1975) suggested that the number of options might not affect participants’ responses. Consequently, no consensus has been reached regarding the number of options required.

Most Likert scales include four to seven categories. An odd number of options is used when researchers need a neutral anchor, such as “Neither agree nor disagree,” whereas an even number of options is used when researchers intend to elicit participants’ opinions or attitudes through answers such as “Agree” or “Disagree.”

Previous research also investigated the appropriate number of options from the perspective of statistical reliability. Lissitz and Green (1975) and Boote (1981) suggested that a 5-point scale was reliable. Cicchetti, Showalter, and Tyrer (1985) examined the interrater reliability using a Monte Carlo simulation and reported an increase in reliability when the number of categories was less than eight. Oaster (1989) indicated that a 7-point scale showed the highest test–retest reliability. Preston and Colman (2000) also revealed that a scale with two to four categories showed the lowest test–retest reliability, and a scale with seven or more categories showed the highest test–retest reliability; however, there was no relation between the number of options and criterion-related validity among scales with 2 to 11 categories. These results indicate that 7-point scales are likely to show higher reliability than are any other number of options. Chang (1994) compared 4- and 6-point scales for the same items and suggested that an increase in the number of categories did not always result in higher reliability. However, other studies have indicated that reliability is independent of the number of options (Bendig, 1953, 1954; Brown, Wilding, & Coulter, 1991; Komorita, 1963; Matell & Jacoby, 1971). In these previous studies, the number of options was discussed from the perspective of reliability, which estimates only the random error in the error of measurement. The main target of the present study, the psychological distance between options, is considered to be more suitable for assessing the systematic error in the error of measurement than Cronbach’s α, intraclass correlation, and test–retest reliability. The number of options has also been examined from the perspective of how participants feel when considering the appropriate option. Preston and Colman (2000) examined the following questions with the same participants: (a) ease of rating, (b) time required to select an answer, and (c) participants’ satisfaction with their ability to express their feelings. Their results suggested that 5 to 10 categories were easy to rate. In addition, 5 categories were evaluated as being short enough to select an answer quickly and 3 or 4 categories were evaluated as being complete enough for participants to express their feelings satisfactorily. Thus, these results indicate that a maximum of 5 categories is adequate for most scales.

Although the number of options has been considered from the viewpoints of researcher orientation, statistical reliability, and participant evaluation, no previous studies focused on the assumption of the original Likert scale—that is, psychological distance between categories is equal—when evaluating the appropriate number of options.

Many psychological scales include a neutral category, such as “Neither agree nor disagree,” to allocate equal psychological distance between the neutral category and the adjacent side categories in line with the assumption that the psychological distance between categories must be equal.

Wakita (2004) described a method for estimating the widths of each category (Figure 1) and showed that the widths were affected by the item contents. The widths were defined as W₁ = C₂−C₁, W₂ = C_3-−C₂, W₃ = C₄−C₃, and it was shown that the psychological distances between each category were equal when W₁:W₂:W₃ = 1:1:1. W₁:W₂:W₃ was skewed when the item contents were negative; specifically, the width of the neutral category was significantly narrower than the widths of the other categories. However, this tool is not adequate for discussing the psychological distance between options. To discuss psychological distance in detail, we must obtain scale values for the categories shown in Figure 1.

Figure 1.

Calculating scale value (µ)

The present study presents a new formula for obtaining scale values that correspond to each original category in order to reveal the differences between these scale values and the original categories. We aimed to examine the appropriate number of categories for Likert scales, focusing on the psychological distance between categories, and clarify how the number of options affects this distance. For the purpose of this study, 4-, 5-, and 7-point scales were used for the same personality scale.

Method

Formula for Calculating Scale Values

Item response theory (IRT) was applied to calculate the scale values of each category in this study. IRT applies the generalized partial credit model (GPCM) by Muraki (1992). This new formula was organized according to the following two assumptions:

Assumption 1: In the Likert scale, a latent continuum is assumed to exist behind each category, and this continuum is divided to give the interval to each category. A border to the next category is assumed to exist at a midpoint between the adjacent categories on the rating scale continuum (Figure 1). Thus, the category of the rating scale has a certain range of length on the rating scale continuum; however, both ends of the categories are open intervals.

Assumption 2: In the GPCM, the intersection of two adjacent categories is defined as the point representing category parameters. This intersection is assigned on the borders of the rating continuum in the Likert scale (Assumption 1).

Scale Value

If the scale values are normally distributed according to category parameters, the expectations of each interval are defined as scale values. For example, the expectation of the interval between −∞ and the first category parameter (C₁) is defined as the scale value of the first category (µ₁), and the expectation of the interval between the first category parameter (C₁) and the second category parameter (C₂) is defined as the scale value of the second category (µ₂). Therefore, in the case of

f (x) = \frac{1}{\sqrt{2 π}} e^{- \frac{x^{2}}{2}},

the scale value (µ_P) of the Pth category, which is the expectation of [C_P₋₁, C_P], is obtained by

μ_{P} = \int_{C_{P - 1}}^{C_{P}} x \times \frac{f (x)}{{\int_{C_{P - 1}}^{C_{P}} f (x) dx}} dx = \frac{f (C_{P - 1}) - f (C_{P})}{\int_{C_{P - 1}}^{C_{P}} f (x) dx} .

When C₀ is −∞ and f(C₀) = 0 and the number of categories is m, C_m for the mth category would be +∞ and f(C_m) = 0. Thus, the resulting µ_P is defined as the scale value of the Pth category.

Management of Number of Options

The Big Five Scale (Wada, 1996), which is a major personality scale, was modified into three types of questionnaires with different numbers of options. This scale is one of the major psychological scales that is commonly used with different numbers of options. From its subscales, 11 neuroticism items (BF-N) and 12 extraversion items (BF-extraversion normal [EN] and BF-extraversion reversed [ER]) were selected. BF-N comprises items that ask about socially negative attitudes, BF-EN asks about socially positive attitudes, and BF-ER asks about socially positive attitudes toward extraversion. For these items, 4-, 5-, and 7- point categories were used as follows: (a) a 4-point scale was adopted based on its frequency of use and participants’ satisfaction of expressing their feelings (Preston & Colman, 2000), (b) a 5-point scale was set up based on its frequency of use and ease of selecting an answer (Preston & Colman, 2000), and (c) a 7-point scale was set up based on the higher reliability of this number of options shown by Cicchetti, Showalter, and Tyrer. (1985), Oaster (1989), and Preston and Colman (2000). These numbers of categories are commonly used in psychological and clinical research. The expressions of ratings for each scale are described in Table 1. The order of the items and other parts of the questionnaire were not changed.

Table 1.

Expressions of Ratings for Each Scale

Number of categories	Anchors
4	Disagree
	Slightly disagree
	Slightly agree
	Agree
5	Disagree
	Slightly disagree
	Neither agree nor disagree
	Slightly agree
	Agree
7	Strongly disagree
	Almost disagree
	Do not really disagree
	Neither agree nor disagree
	Do not really agree
	Slightly agree
	Strongly agree

Participants and Study Period

Participants comprised 772 undergraduate students. The questionnaire was completed anonymously, and a response to the questionnaire was considered to represent informed consent to participate in the study. Questionnaires were administered in the autumn semester of 2002.

Procedure

Questionnaires were randomly administered during a lecture (4-point scale, n = 258; 5-point scale, n = 254; 7-point scale, n = 260) with each participant answering one questionnaire (4-, 5-, or 7-point scale).

Results

Analysis

To compare the characteristics of each number of categories, the following three points were examined: (a) the mean and standard deviation (SD) of each subscale score, (b) the estimates of the coefficient of reliability (coefficient of Cronbach’s α), and (c) the estimates of the scale value based on IRT. Subsequently, the relation between the conventional scale values and the estimated scale value (converted scale score) in (c) was examined. The estimates of scale values by IRT were obtained based on the category parameter by the GPCM (Muraki, 1992). The PARSCLE4.1 (Muraki & Bock, 2003) was used.

Descriptive Statistics (Mean and Standard Deviation)

Each subscale was assigned a consecutive integral item value, such as 1 point or 2 points, from the first category to the end category, and the mean scores were assigned to correspondent subscale scores. To compare these values, the scale scores of the 4-point scale and 7-point scale were adjusted to the same range as the 5-point scale (adjusted scale-scores).¹ The results showed that the mean and SD of each subscale score were not significantly different except for the 7-point scale (Table 2). In the 7-point scale, the adjusted scale score was slightly lower than the other two scales, and the SD was also slightly smaller.

Table 2.

Mean and Standard Deviation of Each Subscale Score

		BF-N			BF-EN			BF-ER
	Number of categories	N	M	SD	N	M	SD	N	M	SD
Conventional scale score	4	257	2.747	0.601	256	2.660	0.647	256	2.074	0.603
	5	252	3.517	0.799	254	3.317	0.803	254	2.552	0.791
		7	257	4.722	1.037	255	4.501	0.998	260	3.366	1.020
Adjusted scale score	4	257	3.434	0.752	256	3.325	0.809	256	2.593	0.754
	5	252	3.517	0.799	254	3.317	0.803	254	2.552	0.791
		7	257	3.373	0.741	255	3.215	0.713	260	2.404	0.729

Note: BF-N = Big Five Scale—neuroticism items; BF-EN = Big Five Scale—extraversion normal items; BF-ER = Big Five Scale—extraversion reversed items.

Reliability

The estimates of reliability were obtained by using Cronbach’s α coefficient (Table 3). No subscale showed an obvious difference in α based on the number of categories.

Table 3.

Estimates of Reliability (Cronbach’s α Coefficient)

Number of categories	BF-N	BF-EN	BF-ER
4	.882	.865	.795
5	.889	.859	.795
7	.900	.858	.805

Note: BF-N = Big Five Scale—neuroticism items; BF-EN = Big Five Scale—extraversion normal items; BF-ER = Big Five Scale—extraversion reversed items.

Estimates of Scale Values for Each Category

The eigenvalues of the matrix of correlation were examined to confirm the unidimensionality of the scale as a latent trait in the IRT model, and the unidimensionality of the scale was confirmed in all subscales. The first and second eigenvalues and their ratios are shown in Table 4.

Table 4.

Eigenvalues of Each Subscale

Number of categories	Eigenvalue	BF-N	BF-EN	BF-ER
4	First (λ₁)	5.189	3.594	3.007
	Second (λ₂)	1.070	0.772	0.829
	λ₁/λ₂	4.849	4.656	3.628
5	First (λ₁)	5.362	3.562	3.013
	Second (λ₂)	1.008	0.779	0.870
	λ₁/λ₂	5.319	4.576	3.464
7	First (λ₁)	5.619	3.545	3.072
	Second (λ₂)	1.105	0.886	0.757
	λ₁/λ₂	5.087	4.000	4.059

Note: BF-N = Big Five Scale—neuroticism items; BF-EN = Big Five Scale—extraversion normal items; BF-ER = Big Five Scale—extraversion reversed items.

Then, the scale values (µ_P) of each category were calculated from the resulting category parameter, which was estimated by IRT. Only the subscale BF-EN in the 7-point scale was estimated from five items because no participants selected “Strongly disagree” for the second item. In addition, the resulting scale values were converted to the range from 1 to 4 points, 5 points, and 7 points in each category (converted item value). For instance, the converted scale values ranged from 1 to 4 points when that scale had four categories.² The category parameters in GPCM, the scale value, and the converted scale value are shown in Table 5, and the converted scale values are shown in Figures 2 to 4. In the BF-ER, the fifth and sixth parameters were not ordered.

Table 5.

Category Parameters in Item Response Theory and Scale Value of Each Category

		Number of Categories
		4			5			7
		Category parameter	Scale value (µ)	Converted scale values	Category parameter	Scale value (µ)	Converted scale values	Category parameter	Scale value (µ)	Converted scale values
BF-N	1	1.320	−1.787	1.000	1.176	−1.668	1.000	1.767	−2.169	1.000
	2	0.001	−0.571	2.020	0.354	−0.723	2.078	0.947	−1.284	2.169
	3	−1.321	0.570	2.978	−0.150	−0.100	2.790	0.509	−0.716	2.920
	4		1.788	4.000	−1.379	0.674	3.673	0.045	−0.272	3.507
	5					1.836	5.000	−1.271	0.530	4.568
	6							−1.996	1.565	5.936
	7								2.370	7.000
BF-EN	1	1.492	−1.932	1.000	1.641	−2.059	1.000	1.894	−2.280	1.000
	2	−0.066	−0.582	2.063	0.589	−1.018	1.986	1.467	−1.655	1.796
	3	−1.426	0.640	3.026	−0.470	−0.054	2.899	0.492	−0.905	2.753
	4		1.876	4.000	−1.761	0.973	3.873	−0.250	−0.116	3.761
	5					2.163	5.000	−1.546	0.782	4.905
	6							−2.057	1.763	6.157
	7								2.424	7.000
BF-ER	1	1.413	−1.865	1.000	1.419	−1.870	1.000	2.102	−2.464	1.000
	2	0.163	−0.692	1.910	0.698	−1.014	1.892	1.240	−1.573	2.192
	3	−1.576	0.549	2.872	−0.582	−0.051	2.896	0.628	−0.905	3.086
	4		2.004	4.000	−1.535	0.982	3.972	−0.605	−0.010	4.285
	5					1.968	5.000	−1.772	1.063	5.722
	6							−1.593	1.678	6.545
	7								2.018	7.000

Note: BF-N = Big Five Scale—neuroticism items; BF-EN = Big Five Scale—extraversion normal items; BF-ER = Big Five Scale—extraversion reversed items.

Figure 2.

Converted item values of the 4-category scale

Figure 3.

Converted item values of the 5-category scale

Figure 4.

Converted item values of the 7-category scale

In the converted scale value of the 4-point scale shown in Figure 1, all converted scale values were allocated around the conventional item values. In the 5-point scale shown in Figure 2, most converted scale values were also around the conventional scale values except the fourth category of BF-N. In contrast, in the 7-point scale shown in Figure 4, half of the converted scale values deviated from the conventional item value. For instance, the fourth and fifth categories of BF-N were smaller than their conventional item values, and the fifth and sixth categories of BF-ER were disproportionately close to 7. Consequently, only the results from IRT evaluation revealed that the psychological distance between categories was affected by the number of options shown in the figures of the converted scale scores.

Comparison of Conventional Scale Scores and the Converted Scale Scores

The conventional scale scores were calculated by summing the item scores that assigned an integer value to each option, and the converted scale scores were calculated from the converted scale value. When calculating descriptive statistics, the difference in the absolute value between these scales was examined to determine the difference between the scores (Table 6). The results indicated that the 4- and 5-point scales had only slight differences less than 0.15, whereas the 7-point scale showed larger differences in the BF-N and the BF-ER. The coefficient of correlation between the conventional and the converted BF-N scores was lowest in the 7-point scale (i.e., 0.993).

Table 6.

Difference Between Conventional and Converted Item Values^a

		BF-N		BF-EN		BF-ER
	Number of Categories	M	SD	M	SD	M	SD
Absolute of difference of conventional item value and converted item value	4	0.007	0.005	0.032	0.017	0.067	0.033
		(0.000-0.022)		(0.000-0.063)		(0.000-0.128)
	5	0.142	0.080	0.074	0.033	0.064	0.028
		(0.000-0.316)		(0.000-0.127)		(0.000-0.108)
	7	0.224	0.119	0.130	0.074	0.260	0.134
		(0.000-0.465)		(0.000-0.246)		(0.000-0.649)

Note: BF-N = Big Five Scale—neuroticism items; BF-EN = Big Five Scale—extraversion normal items; BF-ER = Big Five Scale—extraversion reversed items.

Values in parentheses represent ranges.

Finally, the correlations between the items, which influence the factor analysis and structure equation modeling, were compared by focusing on BF-N, which had the largest variance in psychological distance between the 4- and 7-point scales. The maximum difference in the absolute value of the correlation between items was 0.181 (between Items 7 and 10), and the minimum was 0.002 (between Items 4 and 8).

Discussion

Method for Evaluating Psychological Distance

To clarify how the number of categories influences psychological distance in the Likert scale, 4-, 5- and 7-point scales of the same psychological scale and with the same instructions were compared. Moreover, this study proposed a new method for measuring the scale values to examine the distance between items.

Our new IRT method, which is based on the method reported by Wakita (2004), enabled a discussion of the number of categories in the Likert scale derived from the psychological distance in the rating scales, which has been previously discussed from the perspective of the estimates of the reliability coefficient.

Descriptive Statistics (Mean and Standard Deviation)

The descriptive statistics suggest that in the 7-point scale, the participants tended to select somewhat negative answers, such as “Disagree,” and that they avoided selecting both ends of categories. These tendencies might imply that an increase in the number of options biases respondents against answers containing the strongest expressions.

Reliability

The coefficient of reliability was independent of the number of categories in this study, a finding that is consistent with previous studies showing that the appropriate number of categories cannot be determined based on the estimates of the coefficient of reliability (Bendig, 1953, 1954; Brown et al., 1991; Komorita, 1963; Matell & Jacoby, 1971).

Estimates of Scale Values for Category (Converted Scale Values Obtained by IRT)

A comparison of the numbers of categories indicated that the psychological distance deviated more as the number of categories increased in the BF-N and BF-ER subscales. In the 5-point scale, deviation in the psychological distance was seen, especially in the BF-N subscale. In the 7-point scale, deviation was seen in all the three subscales; however, the psychological distance deviated more in the BF-N and BF-ER subscales than in the BF-EN subscale.

In this study, the number of categories did not influence the descriptive statistics and the estimates of the reliability coefficient, but it did influence the item values. Consequently, the psychological distance estimated by the converted item value by the IRT deviated more in the 7-point scale than in the 4- and 5-point scales. In addition, the 7-point scale did not function well because of the reversal of the category parameters shown in Table 5. Furthermore, this deviation was greater when items asked about socially negative personality traits shown in the BF-N and the BF-ER. In short, these results imply that an attempt to set a neutral category such as “Neither agree nor disagree” between positive and negative categories did not accomplish the intended purpose. These results suggest that it was not necessary to adapt the 7-point scale, which requires more time, and that the psychological distance was sensitive to items with socially negative contents. The latter suggestion supports the following two perspectives based on statistical evidence. First, it is recommended that the words of a rating scale be carefully considered when asking participants to rate contents as reversed items. Second, self-reported questionnaires using the Likert scale are absolutely affected by the bias of social desirability.

Our study not only identified weak points in Likert scales but also suggested a practical method for developing new questionnaires and modifying established items. The new method presented here demonstrated the inequality in the psychological distance of the Likert scale. When developing new scales, our IRT method enabled us to ensure equality in the psychological distance between options, allowing us to select suitable expressions for anchors and an appropriate number of options. For example, whether an increase in the number of positive ratings in a scale such as “Disagree,”“Slightly agree,”“Somewhat agree,”“Moderately agree,” and “Strongly agree” would improve the deviation of the responses when the items might be influenced by social desirability could be shown by scale values. Such manipulation has not yet been used but is necessary to support the important original assumption of the Likert scale that the psychological distance between items is equal.

Limitations and Future Direction

This study aimed to examine whether the number of options had an effect on the psychological distance in the Likert scale by applying IRT theory to consider the appropriate number of options. The results of IRT analysis indicated that the number of options had an effect on the response, especially in the 7-point scale. However, this study assessed only one major psychological (personality) scale using the anchors shown in Table 1. In addition, participants were all undergraduate students. Surveys using other scales and in other populations are needed before the results can be generalized.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

References

Bendig

A. W.

(1953). The reliability of self-ratings as a function of the amount of verbal anchoring and the number of categories on the scale. Journal of Applied Psychology, 37, 38-41.

Bendig

A. W.

(1954). Reliability and the number of rating scale categories. Journal of Applied Psychology, 38, 38-40.

Boote

A. S.

(1981). Reliability testing of psychographic scales: Five-point or seven-point? Anchored or labeled? Journal of Advertising Research, 21, 53-60.

Brown

Wilding

R. E.

Coulter

R. L.

(1991). Customer evaluation of retail salespeople using the SOCO scale: A replication extension and application. Journal of the Academy of Marketing Science, 9, 347-351.

Chang

(1994). A psychometric evaluation of four-point and six-point Likert-type scales in relation to reliability and validity. Applied Psychological Measurement, 18, 205-215.

Cicchetti

D. V.

Showalter

Tyrer

P. J.

(1985). The effect of number of rating scale categories on levels of inter-rater reliability: A Monte-Carlo investigation. Applied Psychological Measurement, 9, 31-36.

Garner

W. R.

(1960). Rating scales, discriminability and information transmission. Psychological Review, 67, 343-352.

Green

P. E.

Rao

V. R.

(1970). Rating scales and information recovery: How many scales and response categories to use? Journal of Marketing, 34, 33-39.

Komorita

S. S.

(1963). Attitude content, intensity, and the neutral point on a Likert scale. Journal of Social Psychology, 61, 327-334.

10.

Lissitz

R. W.

Green

S. B.

(1975). Effect of the number of scale points on reliability: A Monte-Carlo approach. Journal of Applied Psychology, 60, 10-13.

11.

Matell

M. S.

Jacoby

(1971). Is there an optimal number of alternatives for Likert scale items? Study 1: Reliability and validity. Educational and Psychological Measurement, 31, 657-674.

12.

Muraki

(1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159-176.

13.

Muraki

Bock

R. D.

(2003). PARSCALE: Parameter Scaling of Rating Data [Computer program]. Chicago, IL: Scientific Software.

14.

Preston

C. C.

Colman

A. M.

(2000). Optimal number of response categories in rating scales: Reliability, validity, discriminating power, and respondent preferences. Acta Psychologia, 104, 1-15.

15.

Oaster

T. R. F.

(1989). Number of alternatives per choice point and stability of Likert-type scales. Perceptual and Motor Skills, 68, 549-550.

16.

Schuts

H. G.

Rucker

M. H.

(1975). A comparison of variables configurations across scale lengths: An empirical study. Educational and Psychological Measurement, 35, 319-324.

17.

Wada

(1996). Construction of the Big Five Scales of personality trait terms and concurrent validity with NPI. Japanese Journal of Psychology, 67, 61-17.

18.

Wakita

(2004). The distance between categories in rating-scale method: Applying item response model to the assessment process. Japanese Journal of Psychology, 75, 331-338.