Abstract
This study used a novel research approach to investigate the effects of unlabeled response scales on response distributions. Instead of responding to standard questionnaire items respondents were asked to report given judgments on either semantic-differential (SD) or agree-disagree (AD) response scales, thereby showing the extent to which respondents agree upon where to place given judgments. Results from a survey-based study (N = 418) show that respondents to a large extent disagree about where to place judgments on the response scale; the level of agreement for different judgment intensities ranged from 42% to 82% and the level of agreement is lower for AD than SD response scales. The low levels of agreement contribute to non-substantive variance in the data which increases the risk of attenuated or inflated correlations between constructs. Moreover, simulations of actual response distributions suggest that unlabeled response scales may lead to a strong bias in the form of underestimated shares of positive answers. Implications for research and marketing research practice of using unlabeled response scales are discussed and it is recommended that response categories on SD and AD items always should be labeled since this will reduce non-substantive variance and bias in the data.
Keywords
Introduction
Questionnaire item response categories have attracted considerable research attention. There are studies of response scale length (e.g., Dawes, 2008), verbal compared to numerical response category labels (e.g., Bendixen & Yurova, 2012), verbal compared to visual labels (e.g., Gummer et al., 2020), and verbal compared to no labels (e.g., Krosnick & Berent, 1993). Previous research has found that labeling each response category (full labeling) improves the validity (e.g., Krosnick & Berent, 1993), reliability (e.g., Menold et al., 2014; Peters & McCormick, 1966), and statistical power (e.g., Eutsler & Lang, 2015) of questionnaire items compared to unlabeled response categories. However, some studies have failed to find positive effects of full labeling on validity (e.g., Andrews, 1984) and reliability (e.g., Churchill & Peter, 1984). Thus, results in previous research are mixed, which points to a need for further research to better understand the effects of labeling or not labeling response categories in questionnaire-based research. Moreover, previous research has been based on data collections in which respondents answered questionnaires as they would normally do and has focused on aggregate measures of validity and reliability, which limits opportunities to fully understand response behavior for unlabeled items.
This study aims to provide a deeper understanding of how the absence of response scale labels affects respondents’ questionnaire responses. The focus is on how respondents report a given answer by selecting a response category on an unlabeled response scale. Using a novel research approach, in which respondents were instructed to report a certain judgment, the study reduces the effects of response category labels on the other steps in the question-answering process (e.g., interpretation of the question, see Tourangeau & Rasinski, 1988). Thus, the study investigates whether respondents agree upon which category to select on an unlabeled response scale for certain answers and whether the level of agreement is influenced by the type of response scale and factors in the judgment being reported such as the polarity (positive vs negative) or its intensity. The study included two of the most common response scale types in market research and academic studies: Semantic-differential (SD) and agree-disagree (AD). The study makes a theoretical contribution by demonstrating the effects on response behavior of unlabeled response scales and a practical market research contribution by showing how research results such as response distributions are influenced by using unlabeled response scales.
Theoretical background
Unlabeled response scales
Respondents go through a four-step process when answering questions in questionnaires: (1) interpretation of the question, (2) retrieval of relevant information, (3) rendering of a judgment, and (4) reporting of an answer (Tourangeau, 1984, 1987; Tourangeau & Rasinski, 1988). Questionnaire answers are the outcome of the difficulty of the task at each step and the ability and motivation of respondents to fulfill these tasks (Krosnick, 1991). Response labels are important mainly for Step 1, the interpretation of questions (e.g., Mantonakis et al., 2017; Schwarz et al., 1991), and for Step 4, the reporting of answers (e.g., Moors et al., 2014; Weijters et al., 2010), although they are likely to affect also Steps 2 and 3 (Tourangeau, 1987).
The focus in the present context is on the reporting of answers. For most questionnaire items, this step requires respondents to select one response category from several fixed response alternatives. This involves assigning an unobserved judgmental value (rendered in Step 3), which until this point exists only in the mind of the respondent, to one of the response categories (Givon & Shapira, 1984). Normally, respondents should select the response category that is perceived to be closest to the unobserved judgmental value. Thus, if the response categories are labeled the respondent would choose the category whose label corresponds the closest to the judgmental value. For example, if the respondent believes a brand is “quite good,” she or he would select the response category labeled “quite good,” if such a category exists, or the category perceived to be closest to “quite good” (e.g., “very good”). However, if the individual response categories are unlabeled, as is the case for response scales with endpoint labels only, respondents have to infer the intensity of the response categories from contextual factors such as the number of response categories and endpoint labels. This increases the risk of non-substantive variance in the data collected with unlabeled response scales; if respondents make different inferences about the response categories that correspond to certain judgments their scores on the question will reflect not only the “true score” on the question but also differences in their response category inferences (Weijters et al., forthcoming).
In line with this, previous research has found that full verbal labels, compared to endpoint labels, tend to increase the reliability and validity of measurement items (Krosnick, 1999). For example, Krosnick and Berent (1993) found that full verbal labels significantly improved test–retest reliability and criterion-related validity for questions relating to political opinions, and Peters and McCormick (1966) found that internal reliability (coefficient alpha) increased when job–task-related questions were given with full verbal labels compared to numerical labels (see also Eutsler & Lang, 2015; Menold, 2020; Menold et al., 2014). However, some studies have failed to find positive effects on the validity and reliability of full labeling. For example, a large-scale study involving six surveys with a total of 106 measures of 26 constructs found that fully labeled response scales caused weak negative effects on construct validity, and random and correlated error variance (Andrews, 1984). Similarly, a meta-analysis of 101 studies in leading marketing journals comprising 154 measures found that reliability was higher for unlabeled response scales compared to labeled (Churchill & Peter, 1984). Thus, findings are mixed concerning the effects of labels on the validity and reliability of measures, which suggests that certain factors are moderating the effects of labels on research outcomes.
To date, research on response scale labels has relied on “normal” data collections in which respondents have answered several (usually standard) questions, a methodology which makes it impossible to separate the effects of labels on the different steps in the four-step answering process since it is only the outcome of the entire four-step process that is recorded. This study reduces or removes the effects of Steps 1, 2, and 3 by asking respondents to report a given judgment, thereby focusing on Step 4 of the answering process (although the instructions will be subject to interpretation similar to that normally found in Step 1 of the process). Moreover, although some studies have investigated factors that moderate the amount of non-substantive variance (e.g., Krosnick & Berent, 1993, found positive effects of full labels only for respondents with a low level of education), there appears to be no study that has studied whether the properties of the judgment (e.g., intensity or polarity) has an effect on the amount of non-substantive variance or contributes to bias in the data.
This study investigated the following four research questions related to unlabeled response scales:
RQ1. Do respondents infer the same response category for a given judgmental value on an unlabeled response scale?
RQ2. Are there differences in the level of agreement for SD and AD response scales?
RQ3. Does the intensity (“extremely,” “quite,” “slightly”) of the judgment moderate the inference of response categories?
RQ4. Does the polarity (negative, positive) of the judgment moderate the inference of response categories?
Semantic-differential and agree-disagree items
SD and AD items are frequently used in academic research in marketing (Bergkvist & Langner, 2017) and they are also frequently used in market research. Their frequent use makes them suitable candidates for investigating the effects of response scale labeling as the results apply to a wide range of research situations. Moreover, the use of AD items is controversial in academic research and the validity of AD items has been subject to debate (see the brief overview in Dolnicar, 2020; see also Rossiter, 2002). A central concern in this validity debate is how the response scales are interpreted by respondents which point to the need for a better understanding of this.
SD items were developed to measure the meaning of objects and came into use in the early 1940s (Stagner & Osgood, 1941). SD items are bipolar with opposite adjectives (e.g., “good-bad,” “beautiful-ugly”) at the endpoints of a 7-point response scale. Early forms of SD items were preceded by instructions which defined the meaning of the mid-point response category as “neutral” and the other categories, on both sides of the mid-point, as “slightly,” “quite,” and “extremely” (Osgood et al., 1957). Toward the end of the 1960s, it had become standard practice to include these labels above each response category (Heise, 1969). However, in the 1970s, the labels for the response categories became less frequently used, and in current academic research it is common practice to use SD items without labels for the response categories (see Figure 1, Panels A and B, for examples of SD items with and without response category labels).

Different response formats for semantic-differential and agree-disagree items.
Questionnaire items with AD response scales are popular among academics and they have a long history in academic research (see Likert et al., 1934, for an early example). AD items are also popular with marketing practitioners and used in a broad range of customer and market surveys. The current use of AD items include both fully labeled (Panel C in Figure 1), unlabeled response scales (Panel D in Figure 1), and partially labeled response scales in which some but not all of the response categories are labeled.
AD items are unipolar as they capture the presence or absence of a trait or agreement or disagreement with a stated position (Moors et al., 2014). However, AD items are frequently used to measure attributes that are inherently bipolar such as “good” or “clean.” These attributes make up one-half of what linguists call antonymous adjectives, that is, pairs of adjectives that are polar opposites and refer to a common underlying quality (see, for example, Ljung, 1974; Rotstein & Winter, 2004). For example, “good” and “bad” are opposites that refer to the same underlying quality in an object and if a person is asked to evaluate whether an object is “good” their evaluation tends to include “bad,” if the latter applies to the object in question. This means that AD items, unlike SD items, include only half of the evaluative scale respondents use when evaluating the attributes underlying antonymous adjectives if respondents interpret the AD item literally. For example, a literal interpretation of the item “the hotel room was clean” entails that selecting “disagree” means that the room was “not clean” (although it is far from clear what different degrees of disagree mean). However, some respondents may interpret AD items in such a way that selecting “disagree” in response to an item stating that an object has a certain trait (e.g., “good,” “clean”) means that the object has the polar opposite trait (e.g., “bad,” “dirty”) because trait-related cognitions elicit both adjectives in the pair of antonymous adjectives (i.e., the AD response scale becomes a de facto SD response scale), as implied by linguists’ theories of antonyms (e.g., Ljung, 1974). This means that a “disagree” response could mean either that the trait stated in the item is absent or that its polar opposite is present, which makes interpretation of the results challenging and could introduce non-substantive variance in data collected using AD items (see also discussion in Rossiter, 2002).
The extent to which respondents interpret AD response scales literally or as de facto SD scales has received limited, if any, empirical research attention and appears not to be known. Therefore, this study investigated the following research question related to AD response scales:
RQ5. Do respondents treat AD response scales as de facto SD response scales when asked to evaluate inherently bipolar attributes?
Also, the study investigated the following overall research question:
RQ6. Does the absence of response category response labels create biases in the data collected?
Method
This study focuses exclusively on the fourth step, answer reporting, of the questionnaire response process through an experimental research design in which participants were instructed to report a specific judgment. Thus, the focus is on how respondents map their answers onto one of the available response options. The study was based on a 2 (response scale: SD, AD; between-subjects) by 2 (judgment polarity: negative, positive; within-subject) by 3 (judgment intensity: “slightly,” “quite,” “extremely”; within-subject) experimental design. The dependent variable was the response category respondents selected to report the given judgments.
The data were collected during the third wave of an unrelated longitudinal study using a sample of the UK adult (18+ years) population in an online panel (Made in Surveys; https://en.misgroup.io/). The total sample size was 418, with 209 respondents in both SD and AD groups. The mean age of respondents in this study was 46.0 years (SD = 14.96), and 51.9% were women. In comparison, the mean age in the UK adult (18+ years) in 2018 was 48.9 years (SD = 21.70), and 50.6% were women (based on estimates from data available from Office for National Statistics, 2020b). The age difference was at least partly the result of an absence of people over the age of 79 years in the sample. The average annual income in the sample was GBP 37,228 per year (SD = 24,065; median = 31,688), which is slightly higher than the UK average of GBP 35,900 per year in 2019 (Office for National Statistics, 2020a). Thus, overall, the sample was quite similar to the UK population in terms of age, gender, and income.
The main study was unrelated to this study and its questions were not expected to influence responses to the questions pertaining to this study. The main study, which preceded this study in the questionnaire, included 40 closed-ended items measuring the meaning of various words, attitudes toward different types of companies and brands (e.g., small businesses, global corporations), and lifestyles. The median completion time for the full questionnaire was 7 min and 35 s.
There were two versions of the questionnaire: one with a 7-point SD response scale and one with a 7-point AD response scale. There were six questions in each questionnaire, which were preceded by identical instructions in both questionnaires. The instructions told respondents to imagine that they were answering a survey and that they should select the response alternative they would select if they wanted to indicate a certain judgment on the following question. For example, one of the instructions asked respondents to evaluate a hotel as “extremely dirty”: Imagine that you are answering a survey and that you have been asked to evaluate a hotel called Hotel X. Select the response alternative that you would choose if you wanted to indicate that the hotel was “extremely dirty” on the following question:
In the SD version of the questionnaire the instruction was followed by a typical SD item: Below you will find a pair of adjectives. Indicate how well one or the other adjective in each pair describes your opinion of Hotel X. Clean ○ ○ ○ ○ ○ ○ ○ Dirty
In the AD version the instruction was followed by a typical AD item: Indicate the extent to which you agree or disagree with the following statement: Hotel X is clean. Agree ○ ○ ○ ○ ○ ○ ○ Disagree
The decision to use response scales without endpoint qualifiers such as “extremely,” “very,” or “strongly” was motivated by the fact that endpoints without qualifiers are commonly used in research (Bergkvist & Langner, 2017) and that it is not clear how respondents interpret these qualifiers (see discussion in Bergkvist & Langner, 2020).
There were six different objects in the questionnaires, three positive judgments, three negative judgments, and three degrees of judgment intensity (“extremely,” “quite,” “slightly”; Table 1). The judgment intensities were chosen as they correspond to the original labels on the SD response scales (Osgood et al., 1957) and research has demonstrated that they are perceived as equidistant in terms of intensity (Cliff, 1959). Respondents were randomly assigned to the SD or AD questionnaires and the question order was randomized.
Objects and judgments in the questionnaire questions.
AD: agree-disagree; SD: semantic-differential.
The study did not include any attention check to control for lacking participant attention (Paas et al., 2018), which may have contributed to some noise in the data. However, analysis of the data showed that there was only a limited number of respondents who straight-lined their answers; a total of 12 respondents selected the same response category for all questions (seven in the SD item sample and five in the AD item sample). The analyses reported in the “Results” section were run both with and without the respondents who straight-lined and there were no substantive differences in the results (cf. Gummer et al., forthcoming). The “Results” section reports only the results without the respondents who straight-lined (the net sample sizes were 202 and 204, respectively, in the SD and AD item samples).
The verbal judgments should correspond to one specific response category on the two 7-point response scales if respondents consider the mid-point as neutral and distinguish the degree of intensity between the judgment categories. For example, the positive “quite” judgment should correspond to the second response category on the positive end of the response scale. Using the traditional SD coding of the response scale from +3 to −3, the response categories corresponding to the judgments are +3 (“extremely” positive), +2 (“quite” positive), +1 (“slightly” positive), −1 (“slightly” negative), −2 (“quite” negative), and −3 (“extremely” negative). In the analysis, responses were coded as “correct” if the respondent has selected the response category that corresponded to the judgment, for both the SD and AD response scales.
The first four research questions were addressed using binary logistic regression and comparisons of response distributions and proportions. The data file for the logistic regression was created by treating the six judgments made by each respondent as separate observations and coding each judgment (i.e., row in the data file) for the type of response scale (SD, AD), “quite” judgment (other, “quite”), “slightly” judgment (other, “slightly”), and polarity of judgment (positive, negative). The dependent variable in the binary logistic regression model was the type of response (0 = “wrong,” 1 = “correct”) and the independent variables were AD (0 = SD; 1 = AD), “slightly” judgment (0 = other; 1 = “slightly”), “quite” judgment (0 = other; 1 = “quite”), negative polarity of judgment (0 = positive; 1 = negative), and the two-way interactions. The binary logistic regression was run using the robust procedure in Stata to compensate for heteroscedasticity in the data. The comparison of response distributions relied on tests of difference in proportions (i.e., one-sample z-test; two-sample z-test).
Results
Binary logistic regression
The binary logistic regression model was overall significant (Wald χ2 = 268.59; p < .001) with a pseudo R2 of .093. The B coefficients showed that the AD response scale, the “slightly” and “quite” judgments had a significant negative effect on the agreement of selecting the “correct” response category, while the interaction between the AD response scale and “quite” had a significant positive effect (Table 2). Thus, the level of agreement was lower for the AD than the SD response scale (RQ2) and for the “slightly” and “quite” judgments, compared to the “extreme” judgment (RQ3). There was no support for a main effect of polarity on the level of agreement (RQ4). Moreover, the interaction effect showed that agreement increased for the combination of the AD response scale and a “quite” judgment.
B coefficients and average marginal effects (dy/dx) in the binary logistic regression model with response as dependent variable (N = 2436).
AD: agree-disagree.
Agreement about the “correct” response category
Each judgment intensity and polarity corresponded to a response category (e.g., “extreme” positive = “+3”; “quite” negative = “–2”), which made it possible to calculate the proportion of answers in the “correct” response category for each judgment. Overall, there was limited agreement among respondents about what response category to select for the different judgments (Table 3). The proportions of responses were the highest in the “correct” category for all judgments, except for “quite” positive for which there were more responses in the “+1” category than in the “correct” “+2” category. The proportion of responses in the most frequently selected categories ranged between about 40% and 80% for the SD response scale and between about 30% and 70% for the AD scale, thus demonstrating that respondents had quite divergent views on what response category corresponds to a given judgment (RQ1).
Share of responses in the “correct” response category for SD and AD response scales.
AD: agree-disagree; SD: semantic-differential.
The highest share of responses for “quite” SD (49.3%) was in the (“incorrect”) “+1” response category.
The highest share of responses for “quite” AD (41.1%) was in the (“incorrect”) “+1” response category.
The SD response scale had consistently higher levels of agreement than the AD scale, except for the two “quite” judgments which were more or less equal. The share of responses in the “correct” category was significantly higher for the SD response scale than the AD scale for the two “extremely” and “slightly” response categories (two-sample z-tests; p < .05). Thus, there was a lower level of agreement for the AD response scale than the SD response scale (RQ2).
Not surprisingly, the level of agreement was the highest for the two “extremely” categories, with most responses in the “+3” and “–3” response categories (i.e., the “correct” categories). The level of agreement for the two “slightly” judgments (range between 50% and 70%), was significantly lower in both conditions than for the “extremely” judgments (two-sample z-tests; p < .05), except for the positive judgment on the AD response scale. The level of agreement for the two “quite” judgments (range between 30% and 45%) was significantly lower than for the corresponding “extremely” and “slightly” judgments (two-sample z-tests; p < .05), except for the “quite” negative judgment on the AD response scale compared to the corresponding “slightly” judgment. Thus, the intensity of the judgment had a marked effect on the level of agreement with substantial differences in the level of agreement for the three intensity levels, although the relationship was not linear but rather a u-shape (RQ3). The higher agreement for the “extremely” judgments was expected, as the extreme response categories would be the natural choice for these, while it is harder to explain the differences between the “quite” and “slightly” judgments.
The level of agreement was not consistently higher or lower for positive judgments compared to negative judgments (RQ4).
These results are, as expected, in line with the results in the binary logistic regression and they demonstrate the differences in the levels of agreement between the two response scale types and for the different intensity levels.
Response distributions
The response distributions across the entire response scales provide a fuller picture of where respondents placed the given judgments (see Table 4 for the SD scale and Table 5 for the AD scale).
Response distributions across the different judgment intensities for SD response scales.
The “correct” response category is indicated by bold font.
Response distributions across the different judgment intensities for AD response scales.
The “correct” response category is indicated by bold font.
The two “extremely” judgments had the non-“correct” responses scattered across the response scales without any clear tendencies, except for the positive “extremely” judgment on the AD response scale which had a significantly higher share of responses (15.2%) in the response category adjacent to the “correct” category compared to the other non-adjacent categories (one-sample z-test; p < .05).
The highest proportion of responses for the two positive “quite” judgments were in the “wrong” response category (“+1” instead of in “+2”), although a substantial share of the responses, 35% and 29%, were in the “correct” response category (“+2”). There was also a large share of responses in the extreme response categories adjacent to the “correct” categories, particularly for the AD response scale. The two negative “quite” judgments had the highest proportion of responses in the “correct” response category (“–2”), although both had a substantial share of responses in the “–1” and “–3” categories.
For the two “slightly” judgments the non-“correct” responses were clustered on the two adjacent response categories (i.e., “+2,” “–2” and “0”), with between 14% and 16% of responses in the more extreme categories (i.e., “+2” and “–2”) and between 6% and 10% in the less extreme category (i.e., “0”). However, it was only the proportions in the more extreme categories that were significantly higher than the proportions in the remaining categories (one-sample z-tests; p < .05).
If respondents treated AD response scales as de facto SD response scales (RQ5), this should be evident in the response distributions for the negative judgments, since a literal interpretation of the AD response scale makes it unclear where to place negative judgments, while an SD interpretation would distinguish degrees of negative judgments. The response distributions for the negative judgments suggest that a majority of respondents interpreted the response scale as an SD scale: The response distributions on the AD response scale are similar to those on the SD response scale, with a majority of responses in the “correct” response category, although the shares were significantly lower for the “slightly” and “extremely” judgments on the AD scale (two-sample z-tests; p < .05). However, there were also significantly (two-sample z-tests; p < .05) higher shares on “–3” for the negative “slightly” and “quite” judgments on the AD response scale compared to the SD scale. There was also a tendency toward higher shares on “+3” for the positive “slightly” and “quite” judgments on the AD response scale compared to the SD scale, and a similar tendency with higher shares on “0” on the AD than the SD scale for both negative and positive “slightly” and “quite” judgments, although most of these differences did not reach statistical significance.
Effects on actual response distributions
The response distributions for the given judgments were used to simulate what the effects of unlabeled response scales would be on real questionnaire data collection. The simulations assumed 1100 respondents answering a question (e.g., brand attitude measured on a “good-bad” SD response scale or “X is a good brand” on an AD response scale). The distribution of “true” judgments (i.e., the hypothetical set of judgments respondents were assumed to have wanted to map onto the response scale) was assumed to be heavily skewed toward positive responses, with a majority of responses on “quite” and “slightly” positive, and a relatively small share of responses on the negative half of the response scale (i.e., a distribution similar to what a well-liked brand would obtain in a survey). The simulation had to assume zero neutral responses as neutral was not included among the given judgments in the study. The “actual” response distributions were estimated by multiplying the number of “true” judgments with the response frequencies for the corresponding judgment in the study (Tables 4 and 5).
The simulated results show that there would be large effects on the actual response distributions (Tables 6 and 7). The share of positive judgments would be underestimated with between 10 and 13 percentage points and the share in the “top two boxes” would be underestimated with about 16 percentage points. The results also show that the share of negative judgments would be overestimated by three to four percentage points. Thus, the simulations suggest that research results would be biased since the non-substantive variance in the mapping of judgments does not cancel out between the different judgment levels (RQ6).
Simulated response distribution on SD response scale.
Simulated response distribution on AD response scale.
Discussion
The results of this study show that unlabeled response scales introduce a significant amount of non-substantive variance in the data collected, particularly for non-extreme judgments such as “quite” and “slightly,” as demonstrated by the considerable variation in which response category respondents select for the same judgment intensity. Also, the results show that AD response scales have lower levels of agreement than SD response scales and that a majority of respondents treated AD response scales as de facto AD response scales.
Unlabeled response scales are common in academic research, particularly for SD items. The results of this study suggest that the amount of non-substantive variance in the data from these studies is unnecessarily high. As a consequence correlations between constructs may be attenuated, if the non-substantive variance is random measurement error, or inflated, if it is correlated measurement error (Andrews, 1984). There is also a risk that unlabeled response scales introduce bias in the data, that is, a tendency for scores to be consistently higher or lower than the “true” score (Andrews, 1984), as suggested by the results of the simulations of actual response distributions. Thus, the use of unlabeled response scales could lead to both false negative and positive results when estimating the relationship between two constructs, as a result of increased non-substantive variance, and erroneous classification of objects (e.g., in manipulation checks), as a result of bias.
The risks for marketing research of unlabeled response categories are the same as for academic research. However, the consequences, particularly of bias, could be more serious. The simulated response distributions suggest that the share of positive respondents could be substantially underestimated. For any company, it is a major difference whether 40% or 55% of respondents’ answers are placed in the “top two boxes” on, say, a brand liking question.
The recommendation based on the present results is that all response categories always should be labeled because this is likely to reduce non-substantive variance and bias in the data (assuming that fully labeled response scales, as suggested by previous research, reduce disagreement over where to place judgments; see, for example, Krosnick & Berent, 1993; Peters & McCormick, 1966). The results also suggest that it is better to use SD response scales for inherently bipolar attributes (e.g., “good-bad”) than to use AD response scales, as respondents appear to diverge in their interpretation of the disagree end of the response scale.
A potential limitation of this study is that it relied on instructions to respondents to imagine that they were answering a questionnaire and to select response categories for given verbally presented judgments. This is an artificial approach that is different from how respondents normally select response categories when answering questionnaires, and it relies on respondents’ willingness and ability to follow the instructions. The extent to which this might have harmed the validity of the responses, if at all, is not clear. However, the methodology in this study is not fundamentally different from other forms of questionnaire-based data collection, which also relies on respondents’ willingness and ability to follow instructions. Moreover, the limited number of straight-line responses (reported in the “Method” section) suggests that most respondents did pay attention to the instructions and questions. The research approach also assumes that respondents’ unobserved judgments are verbal and that they use similar judgment category labels. The extent to which this is true is not known and future research should investigate what type of judgment labels people use when thinking about evaluative judgments (e.g., verbal or numerical) and how they are denominated.
It cannot be taken for granted that the online panel sample used in this study is representative and some caution is warranted in generalizing the results to the entire adult UK population. However, the sample was similar to the UK population in terms of age, gender, and income, and it seems likely that the results would apply to large parts of the population.
Finally, although previous research has demonstrated that labeled response scales outperform unlabeled response scales (e.g., Moors et al., 2014), it is not safe to assume that labeled response scales will yield perfect results. Future research should include a control group with fully labeled response scales to provide a direct comparison with unlabeled response scales.
