Abstract
Many studies have shown that vague or ambiguous questions are often interpreted idiosyncratically by respondents and thus can increase measurement error. This article provides some evidence that the cognitive effort required to comprehend survey questions affects data quality in a similar way. A web survey experiment revealed that respondents receiving less comprehensible questions provided lower-quality responses (as indicated by breakoff rates, number of nonsubstantive responses, number of neutral responses, and over-time consistency) than respondents receiving control questions that were easier to comprehend. Moreover, interaction effects of question comprehensibility with respondents’ verbal skills and their motivation to answer surveys were found. These findings indicate that survey designers should minimize the cognitive effort required to comprehend their questions and the article suggests specific ways how to do so.
Introduction
Asking survey questions that are easily and consistently understood by all respondents is a prerequisite for obtaining reliable and valid data. This notion has been so prominent in the questionnaire design literature that it can almost be conceived as being axiomatic (Fowler 1992). Ideally, respondents find it easy to understand the meaning of a question and interpret it in the way the researcher intended. To achieve these goals, survey designers need to formulate questions that (1) are unambiguous and (2) require little processing effort. Earlier research focused almost exclusively on the first of these two aspects and examined the effects of vague or ambiguous question wordings on interpretation variability. For example, it has been shown that vague relative terms, such as often or substantially, are interpreted quite differently by respondents depending on the content of the question and respondents’ gender, age, education, and race (Bradburn and Miles 1979; Schaeffer 1991). Similar effects have been found for ambiguous or abstract terms, such as exercise, welfare, or most people, which can mean different things to different people (Fowler 1992; Smith 1987; Sturgis and Smith 2010). Hence, vagueness and ambiguity can lower response quality and increase measurement error in the survey data.
The effort required to comprehend a survey question (i.e., its comprehensibility) may affect responses similarly. If questions are difficult to understand (because of their linguistic complexity, for instance), respondents may not be willing or able to invest the additional effort required to overcome these difficulties, and thus may not provide meaningful answers. Instead, they may choose to take it easy and apply satisficing response strategies (Krosnick 1991). For example, if respondents experience difficulties determining the meaning of an attitudinal question, they may decide that they have no strong opinions about this issue and then provide a nonsubstantive (e.g., “don’t know”) or neutral (“neither/nor”) response. Similarly, if confronted with an ambiguous question, respondents may not simply interpret it idiosyncratically, which would be easy for them to do but result in high interpretation variability. Instead, if they perceive the ambiguity and the difficulty involved in answering the question meaningfully, they may decide that resolving this ambiguity is too wearisome and then may offer a nonsubstantive response. In both cases, the difficulty with understanding the meaning of a question would lead to inaccurate answers. Consequently, designing questions to minimize the cognitive effort required to process them is an important strategy for reducing comprehension difficulties and thus response error. The ways in which the cognitive effort required to comprehend a survey question affects response quality have received comparatively little attention to date (see Krosnick 1991 for a theoretical discussion of this issue).
Survey methodologists have only recently begun to identify specific question characteristics that reduce question comprehensibility (e.g., Graesser et al. 2006; Lessler and Forsyth 1996; Saris and Gallhofer 2007; Tourangeau et al. 2000) and to examine their effects on respondent burden experimentally (Lenzner et al. 2010; Lenzner et al. 2011). For example, extending earlier research by Graesser et al. (2006), Lenzner et al. (2010) identified seven psycholinguistic text features that make questions difficult to comprehend: low-frequency words, vague or imprecise relative terms, vague or ambiguous noun phrases, complex syntax, complex logical structures, low syntactic redundancy, and bridging inferences. Their web survey experiment revealed that questions containing these text features produced longer response times compared to similar questions that did not include such text features. These findings were supported by a recent eye-tracking study that showed that respondents fixated longer on questions containing one of these text features and required more fixations to process, reread, and interpret these questions (Lenzner et al. 2011).
Even though these earlier studies provided clear empirical evidence that such text features reduce question comprehensibility, there was limited evidence that this results in poorer data quality. This article looks more closely at the ways in which question comprehensibility, or more specifically, the effort required to comprehend survey questions, affects response quality. It reports the findings of an experiment that examined whether questions including the seven text features affect response quality indicators and whether these effects are moderated by respondent characteristics such as verbal intelligence and motivation.
Theoretical Background
Answering a survey question requires respondents to carry out numerous cognitive tasks (e.g., question comprehension, information retrieval, judgment, response formatting, and editing; see Tourangeau et al. 2000), and hence calls for investing considerable cognitive effort. According to satisficing theory (Krosnick 1991), survey respondents may not always be willing or able to meet these demands and to perform these processes thoroughly and accurately (i.e., to “optimize”). Instead, they may try to shortcut these processes and apply response strategies that simplify the survey endeavor (i.e., to “satisfice”). For example, satisficing response strategies include saying “don’t know” instead of reporting an opinion, selecting the first answer option that seems reasonable, or selecting the midpoint response option. All of these strategies are problematic in surveys, as they increase measurement error and produce lower-quality data.
According to Krosnick (1991), the probability that respondents apply satisficing response strategies is a function of three factors: question difficulty, respondent ability, and respondent motivation. The more difficult a survey question is to understand and to answer, and the lower the respondents’ cognitive abilities and motivation, the more likely they are to satisfice in a survey. Given that cognitive ability and motivation are respondent characteristics that are difficult if not impossible to manipulate by the survey designer, there is only one way by which they can minimize the occurrence of satisficing behavior, namely by reducing question difficulty. One way to reduce question difficulty is by formulating survey questions that are easy for respondents to comprehend.
Ample evidence from psycholinguistics (e.g., Duffy et al. 1988; Haviland and Clark 1974; Horning 1979; Inhoff and Rayner 1986; Kimball 1973; Kintsch and Keenan 1973; Mosier 1941) indicates that survey designers can increase the comprehensibility of their questions by avoiding several problematic text features. These text features are briefly discussed in the following subsections. A more detailed account of the features can be found in Lenzner et al. (2010).
Low-frequency words: People are slower at accessing the meaning of low-frequency words and must work harder to comprehend sentences in which they occur, compared to higher-frequency words (e.g., Inhoff and Rayner 1986). Consider the following example: (Q1a) During the last 4 weeks, how often did you suffer from somatic pain? (Q1b) During the last 4 weeks, how often did you suffer from physical pain?
Vague or imprecise relative terms: Vague or imprecise relative terms are predicates whose meanings are relative rather than absolute, such as often, rarely, or substantially (e.g., Mosier 1941). These terms can be interpreted in various ways, making it potentially difficult for respondents to extract the meaning intended by the survey designer. For example, compare the vague wording in (Q2a) to the more concrete wording in (Q2b): (Q2a) Have you recently seen a doctor? If yes, please provide the number of visits you paid to the doctor. (Q2b) Have you seen a doctor during the last 4 weeks? If yes, please provide the number of visits you paid to the doctor.
Vague or ambiguous noun phrases: Noun phrases with unclear (e.g., cultural events) or ambiguous (e.g., bank) referents are difficult to comprehend because respondents may not immediately know what the noun phrase refers to or which sense of the word is relevant in the question. Consequently, to facilitate comprehension, both ambiguous and abstract words should be avoided in survey questions and replaced by unambiguous and more concrete words. For example: (Q3a) In your free time, how often do you attend cultural events? (Q3b) In your free time, how often do you go to the theater?
Complex syntax: Complex syntactic structures (e.g., left-embedded syntax, propositionally dense sentences) quickly overload the processing capabilities of readers and require rereadings of unclear parts of the question. For example, consider the left-embedded syntax in (Q4a) in contrast to the right-embedded syntax in (Q4b): (Q4a) How likely is it that if a law was considered by parliament that you considered to be unjust or harmful, you, acting alone or together with others, would try to do something against it? (Q4b) How likely is it that you, acting alone or together with others, would try to do something against a law that was considered by parliament and that you believed to be unjust or harmful?
Complex logical structures: Questions with complex logical structures (e.g., with numerous logical operators such as or) require respondents to remember a large amount of information while simultaneously processing other, new information. Thus, they quickly overload respondents’ working memory capacity. For example: (Q5a) There are many ways people or organizations can protest against a government action or a government plan they strongly or at least somewhat oppose. In this regard, do you think the following should be allowed? Organizing public meetings to protest against the government. (Q5b) There are many ways people or organizations can protest against a government action they strongly oppose. In this regard, do you think the following should be allowed? Organizing public meetings to protest against the government.
Low syntactic redundancy: Low syntactic redundancy reduces the predictability of the grammatical structure of a question (Horning 1979) and thus makes it harder for readers to comprehend the course of action. For example, syntactic redundancy can be increased by avoiding nominalizations (i.e., verbs that have been transformed into nouns). For example: (Q6a) Do you agree or disagree with the following statement? Trade unions are important for the job security of employees. (Q6b) Do you agree or disagree with the following statement? Trade unions are important to secure the jobs of employees.
Bridging inferences: Drawing bridging inferences is a time-consuming process (see, e.g., Myers et al. 2000) that is required if the actual survey question is preceded by an introductory sentence and if information from both sources has to be connected. The following example illustrates this: (Q7a) All systems of justice make mistakes. What do you think is worse, to convict an innocent person or to let a guilty person go free? (Q7b) All systems of justice make wrong verdicts. What do you think is worse, to convict an innocent person or to let a guilty person go free?
While respondents answering (Q7a) need to infer that by “making mistakes” the questionnaire designer refers to making wrong judgments, the wording in (Q7b) makes clear that the questions focuses on this one particular instance of judicial error.
Method
Design and Hypotheses
To examine the effects of these psycholinguistic text features on response quality indicators, I conducted an experiment in which respondents were asked to complete two web surveys during March and April 2010. In the first survey, respondents were randomly assigned to one of two questionnaire versions: a questionnaire including the seven problematic text features that reduce question comprehensibility (text feature condition) or a questionnaire that did not include any questions with such features (control condition).
Dependent variables in this first survey were breakoff rates, number of nonsubstantive responses (“Don’t knows” or skipped questions), and number of neutral responses (midpoint responses) as response quality indicators (see Galesic 2006; Knäuper et al. 1997; Velez and Ashworth 2007). Assuming that lower question comprehensibility reduces response quality, I expected to find more breakoffs, more nonsubstantive responses, and more neutral responses in the text feature condition than in the control condition (Hypothesis 1a). Moreover, according to satisficing theory (Krosnick 1991), I hypothesized that these effects would be greater among respondents low in verbal intelligence (i.e., cognitive ability) and/or motivation (Hypothesis 1b).
Respondents who completed the first survey were reinvited to participate in a second web survey 2 weeks after the initial invitation. This second survey asked exactly the same questions as the first one, making it possible to assess the reliability of the responses in both conditions. By comparing the answers given in the first web survey with the answers given in the second web survey, an index of over-time consistency could be calculated (Krosnick et al. 2002; Poe et al. 1988). Higher over-time consistency is an indicator of higher reliability and thus superior response quality. Assuming that responses to the text feature questions are inaccurate, I hypothesized that the over-time consistency would be lower in the text feature condition than in the control condition (Hypothesis 2a) and that this effect would be more pronounced among respondents low in verbal intelligence and/or motivation (Hypothesis 2b).
Respondents
Respondents were recruited from the German nonprobability online panel Sozioland (Respondi AG). Members of this panel have signed up online to receive invitations for surveys on all kinds of topics covering society, media, health, and politics. For participation in this survey, panelists did not receive any incentives. Of the 7,581 panel members who were invited, 1,195 participated in the first web survey. Some respondents were excluded from the data set because they either finished the survey after breakoff (n = 12), dropped out of the study before answering any experimental question (n = 152), reported having been interrupted or distracted during answering (n = 133), clicked through the survey without answering (“lurkers,” n = 1), cheated on the Vocabulary and Overclaiming Test (VOC-T, claiming to know the meaning of two or three fake words or skipping the test, n = 8; see the next section for a description of the test), or did not complete the survey (n = 64; only considered in the analysis of breakoff rates), leaving 825 respondents in the analysis and resulting in a response rate of 10.9% (AAPOR RR1). Of these, 52% were female and 48% were male; 58.1% had received 12 or more years of schooling, 33.2% had received 10 years, and 8.7% had received 9 or less years of schooling. Respondents were between 16 and 77 years of age, with a mean age of 42 (SD = 13.3). Following the random assignment, the two groups consisted of 407 respondents in the text feature condition and 418 respondents in the control condition.
These 825 respondents were reinvited to answer the second online questionnaire. In total, 515 (62.4%) respondents completed this second survey, allowing for the calculation of over-time consistency estimates for 248 respondents in the text feature condition and for 267 respondents in the control condition. Respondents in the second survey were between 16 and 77 years of age, with a mean age of 44 (SD = 12.5), and 51.3% were female. A total of 56.8% of the participants had received 12 or more years of schooling; 33.8% had received 10 years, and 9.4% had received 9 or less years of schooling. In both surveys, respondents in the two conditions did not differ with regard to gender, age, and educational attainment.
Instruments
The questionnaires in both surveys included 60 (attitudinal, factual, or behavioral) questions on various topics, covering the environment, health, leisure, role of government, national identity, and social inequality (10 questions for each topic). With the exception of one question designed by the author, the questions were taken from the International Social Survey Program, the German General Social Survey, and the German Socio-Economic Panel. To examine the effects of the seven psycholinguistic text features on the response quality indicators, 28 (4 questions per text feature) of the 60 questions were experimentally manipulated so that they contained a problematic text feature in one condition (text feature version) but not in the other (control version). The experimental questions were constructed according to the rewriting rules described in Lenzner et al. (2010). The remaining 32 questions were used as filler items and were asked in the original wording.
To measure respondents’ verbal intelligence, I administered an adapted version of the German vocabulary test (WST; Schmidt and Metzler 1992). In the original version, the WST comprises 42 word sequences, each containing one real word (the target word) and five meaningless words. Participants are instructed to indicate which word in each sequence is the real word. For this study, the WST was modified so that it could be efficiently administered in a web survey. The modified version (WSTmod) included 15 target words of variable word difficulty. For every word, respondents indicated on a 2-point scale (yes/no) whether they knew the meaning of the word and could “explain it to someone else.” Verbal intelligence test scores were obtained by summing up the number of positive responses to the 15 words and hence could range from 0 to 15 (M = 11.43, SD = 2.13).
A potential problem of the WSTmod is that its yes/no answer format is prone to socially desirable responding. However, respondents’ WSTmod scores were not correlated (r = .05, p > .05) with respondents’ scores on a social desirability index (composed of items adapted from Paulhus 1991 and Stöber 1999). Hence, the WSTmod was an acceptable measure of respondents’ verbal intelligence. To assess the convergent validity of the WSTmod, I correlated respondents’ scores on the WSTmod with their scores on a second vocabulary test (VOC-T; Ziegler et al. In press). Respondents’ scores on both measures were highly correlated (r = .64, p < .001). The verbal intelligence scores were moderately correlated with education (r = .37, p < .001); however, earlier research has shown that education is not a good proxy measure for cognitive ability among web survey respondents (see Peytchev 2009). Hence, I found it important to include this more direct measure of respondents’ verbal intelligence in the questionnaire.
As indicators of respondents’ motivation to answer survey questions, I measured their need for cognition (NFC; Cacioppo and Petty 1982) and need to evaluate (NTE; Jarvis and Petty 1996). While NFC is a measure of how much people enjoy thinking and performing effortful mental exercises, NTE is a measure of how opinionated people are and how willingly they engage in evaluation. People who are low in NFC and/or NTE are presumably more susceptible to satisfice in surveys than those high in these traits (see Krosnick 1991; Toepoel et al. 2009). NFC and NTE are usually measured with 36 and 16 items, respectively. For reasons of efficiency, however, I selected 5 items of the German NFC scale (Bless et al. 1994) and 6 items of the German NTE scale (Collani 2009) on the basis of their factor loadings, discrimination power, and face validity. The raw scores of both scales were combined to calculate an average index of respondent motivation (MOT, Cronbach’s α = .75). The original German and an English version of all instruments used in this study are available from the author on request.
Procedure
In total, the first survey consisted of 122 items, with approximately 60% of the items presented on a separate screen. Grids were used for administering the two vocabulary tests (15 items per screen), the NFC and NTE scales (5 and 6 items per screen, respectively), and the social desirability items (5 and 6 items per screen, respectively). All of the experimental questions were presented on separate screens and all items were closed ended, requiring respondents to mark their answers by clicking on a radio button.
First, respondents completed the WSTmod and the VOC-T vocabulary tests, each consisting of 15 words of different word frequency. Then they answered four background questions on gender, age, education, and native language, followed by the NFC and NTE items, as well as three questions on political interest, international environmental laws, and social benefits. Subsequently, respondents were randomly assigned to either the text feature or the control condition. In both conditions, respondents answered a total of 60 questions in randomly ordered blocks of thematically related questions. Of these 60 questions, 28 were experimentally manipulated so that they contained a text feature in the text feature condition but not in the control condition. Finally, respondents answered 11 social desirability items (adapted from Paulhus 1991; Stöber 1999) and three questions on web survey administration and evaluation (problems with Internet connection, interruption or distraction during answering, importance of surveys for society).
On average, respondents in the text feature and control condition completed the first web survey in 19.8 minutes (SD = 9.3) and 18.8 minutes (SD = 7.4), respectively. To answer the second survey, which consisted of the 28 experimental and 32 filler items only, respondents required 12.8 minutes (SD = 9.8) in the text feature condition and 12.0 minutes (SD = 7.5) in the control condition on average.
Results
I first looked at differences in the response quality indicators across the two experimental conditions (Hypotheses 1a and 2a). Except for breakoffs, these analyses were followed by regression analyses and—if appropriate—simple slopes analyses to examine whether and to what extent effects of question comprehensibility were moderated by verbal intelligence and/or motivation (Hypotheses 1b and 2b). The descriptive statistics of the response quality indicators and the predictor variables, as well as the intercorrelations between all variables in both data sets, are shown in Table 1.
Means, Standard Deviations, and Intercorrelations for Response Quality Indicators and Predictor Variables
Note: All coefficients are Pearson correlations, a0 = Control questions, 1 = Text feature questions.
*p < .05.
**p < .01.
***p < .001.
Breakoffs
A total of 64 respondents (7.2%) dropped out of the first survey before completing it. As expected, more respondents broke off in the text feature condition (n = 38) than in the control condition (n = 26). However, this difference was not statistically significant (χ2 = 2.4, df = 1, p > .05).
Nonsubstantive Responses
The tendency to provide nonsubstantive responses was estimated by calculating the number of “Don’t knows” (DKs) and missing answers across the 28 experimental questions. I also analyzed the responses to the 32 filler questions (which were identical in the two questionnaire versions) and found no significant differences between both conditions with regard to the dependent variables (nonsubstantive responses, neutral responses, over-time consistency). These findings suggest that both groups were equivalent in their response behavior to unproblematic questions.
On average, respondents in the text feature condition gave significantly more nonsubstantive responses to the experimental questions (6.2% of the answers) than respondents in the control condition (4.9%), χ2 = 16.1, df = 1, p < .001. In a second step, I fitted two regression models. Since the dependent variable took the form of a count (number of nonsubstantive responses) and the data included a large number of zero counts (i.e., 327 of the 825 cases did not provide any nonsubstantive response), zero-inflated Poisson regression models were estimated (see Federico and Schneider 2007). To confirm that this decision was appropriate, I conducted Vuong Tests (Long 1997) for all zero-inflated regression models that were performed in the analyses.
These tests indicated that the zero-inflated models were more appropriate than the ordinary Poisson regression models (nonsubstantive responses, model 1: z = 5.72, p < .0001, model 2: z = 5.46, p < .0001; neutral responses, model 1: z = 1.80, p < .05, model 2: z = 2.69, p < .01). In each of the reported regressions, the inflation model contained the same set of predictor variables as the count model. The models included question comprehensibility, verbal intelligence (WSTmod), motivation (MOT), and the two- and three-way interactions of these variables. The question comprehensibility variable was dummy coded (0 = control condition, 1 = text feature condition) and the continuous predictor variables WSTmod and MOT were centered prior to analysis (see Whisman and McClelland 2005). Robust standard errors were used in the analyses to adjust for heterogeneity in the models.
Table 2 summarizes the results of the regression models. In model 1, only question comprehensibility, verbal intelligence, and motivation were included to examine the main effects of these variables on nonsubstantive responses. Statistically significant effects were found for all three variables (comprehensibility: b = .22, p < .05, verbal intelligence: b = –.08, p < .01, motivation: b = –.22, p < .01), indicating that lower levels of comprehensibility, verbal intelligence, and motivation increased the number of nonsubstantive answers. Model 2 also included the two- and three-way interactions of the three individual variables to examine whether the impact of question comprehensibility on providing nonsubstantive answers was moderated by respondents’ verbal intelligence and/or motivation. In this model, the two-way interaction between comprehensibility and verbal intelligence was significant (b = –.12, p < .05), indicating that the effect of question comprehensibility on nonsubstantive responses depended upon the particular level of respondent’s verbal intelligence.
Regression Analyses Summary for Variables Predicting Nonsubstantive Responses
Source: Web Survey 1.
Note: Entries are zero-inflated Poisson regression coefficients and robust SEs. The functional form for the inflation models was the logistic; estimates for these models are not shown. The question comprehensibility variable was dummy coded (0 = control condition, 1 = text feature condition).
*p < .05.
**p < .01.
***p < .001.
It is important to note that the coefficients of the individual predictors in moderator regression models do not estimate main effects (as in model 1) but conditional effects that hold only when all other individual variables have a value of 0 (which represents the mean of the continuous variables that have been centered and the control condition of the categorical variable). Similarly, the two-way interactions are interpreted at a value of 0 (i.e., the mean) for the third variable. Hence, these coefficients should not be interpreted as “main effects” (Whisman and McClelland 2005).
To examine this interaction in more detail, I conducted simple slopes analyses (Aiken and West 1991). These can be employed to determine whether the question comprehensibility effects are larger for respondents low in verbal intelligence (i.e., one standard deviation below the mean) than for respondents high in verbal intelligence (i.e., one standard deviation above the mean). The analyses revealed a significant relationship between question comprehensibility and the propensity to provide nonsubstantive responses for respondents at low levels of verbal intelligence (b = .49, p < .001), but not for respondents at high levels of verbal intelligence (b = .01, p > .05). Hence, the effect of question comprehensibility on providing nonsubstantive responses was more pronounced among respondents with limited verbal skills.
In contrast to my expectations, comprehensibility did not interact with respondent motivation, suggesting that less comprehensible questions increased the number of nonsubstantive responses for highly and lowly motivated respondents alike. Also in contrast to my expectations, I found no significant three-way interaction; hence, neither of the two-way interactions was moderated by a third variable.
Neutral Responses
The propensity to give neutral responses was estimated by calculating the number of “neither/nor” responses given to those eight experimental questions that offered a middle category. As hypothesized, respondents answering text feature questions provided more neutral responses (15.5% of the answers) than respondents answering control questions (12.4%), χ2 = 13.5, df = 1, p < .001. Again, I fitted zero-inflated Poisson regression models to examine this effect in more detail (see Table 3). The regression models included the same set of variables as the regression models reported above.
Regression Analyses Summary for Variables Predicting Neutral Responses
Source: Web Survey 1.
Note: Entries are zero-inflated Poisson regression coefficients and robust SEs. The functional form for the inflation models was the logistic; estimates for these models are not shown. The question comprehensibility variable was dummy coded (0 = control condition, 1 = text feature condition).
+ p < .10.
*p < .05.
**p < .01.
Again, model 1 looked at the main effects of the three key independent variables and revealed a marginally significant effect of question comprehensibility (b = .17, p < .10) and a significant effect of motivation (b = –.15, p < .01) on the number of neutral responses. Model 2, which included the two- and three-way interactions of the variables, showed a significant interaction between comprehensibility and motivation (b = –.19, p < .05), qualifying the main effects and suggesting that the effect of comprehensibility on neutral responses depended on respondents’ level of motivation. Simple slopes analyses revealed a significant simple slope for respondents with low levels of motivation (b = .27, p < .05), but not for highly motivated respondents (b = −.04, p > .05). Hence, low question comprehensibility only increased the number of neutral responses for respondents low in motivation.
Surprisingly, the models revealed no effects of verbal intelligence, suggesting that this variable did not affect the likelihood of selecting neutral responses. Moreover, the three-way interaction was again not significant, so the significant two-way interaction of comprehensibility with motivation was not moderated by respondents’ verbal intelligence.
Over-Time Consistency
To examine the consistency of respondents’ answers to the same questions across the two web surveys, I calculated the gross error rate (i.e., the simple response variance) for 26 of the 28 experimental questions (see Poe et al. 1988). Two questions were excluded from these analyses because they asked about behaviors during a specific time period (e.g., “during the last 4 weeks”), and thus were not comparable across the two surveys. To calculate the gross error rate, I computed a new variable for each question, coded 1 for respondents who gave the same answers and 0 for those who gave different answers in the two surveys (see Krosnick et al. 2002). The average gross error rate across all 26 questions was significantly higher in the text feature condition (35.0%) than in the control condition (32.9%), indicating that the text feature questions reduced the reliability of responses (χ2 = 6.8, df = 1, p < .01). Given that the dependent variable (i.e., number of inconsistent responses) did not contain any zero counts, I fitted Poisson regression models to look for any interaction effects (Table 4).
Regression Analyses Summary for Variables Predicting Over-Time Consistency of Responses
Sources: Web Survey 1 and 2.
Note: Entries are Poisson regression coefficients and robust SEs. The question comprehensibility variable was dummy coded (0 = control condition, 1 = text feature condition).
+ p < .10.
*p < .05.
**p < .01.
***p < .001.
Again, model 1 only looked for main effects and revealed significant effects of question comprehensibility (b = .06, p < .05) and verbal intelligence (b = –.02, p < .01) on over-time consistency. The reliability of responses was not affected by respondents’ level of motivation (b = −.02, p > .05). Moreover, model 2 revealed no significant two- or three-way interaction predicting the consistency of responses, so the relation between question comprehensibility and over-time consistency was neither moderated by verbal intelligence nor by motivation.
Discussion and Conclusion
This study has found substantial effects of low survey question comprehensibility (operationalized by seven text features that undermine comprehension) on response quality indicators: Respondents receiving less comprehensible questions were (nonsignificantly) more likely to drop out of the survey and they provided significantly more nonsubstantive (DKs and missings), more neutral (i.e., midpoint), and fewer reliable responses than respondents answering comprehensible questions. Moreover, some of these effects were conditional on respondents’ verbal skills (nonsubstantive responses), while others were conditional on respondents’ motivation (neutral responses). Taken together, these findings indicate that survey data quality is reduced if questions are difficult to understand and exceed the processing effort that respondents are willing or able to invest.
With regard to satisficing theory, the study did not find any three-way interactions of question comprehensibility (i.e., task difficulty), verbal intelligence (i.e., cognitive ability), and motivation in the way that, for example, the question comprehensibility effects were strongest among respondents both low in verbal intelligence and motivation. Thus, survey satisficing may not generally be the result of a three-way interaction of these variables. Instead, the significant two-way interactions suggest that respondents employ specific response strategies depending on their level of verbal intelligence, on one hand, and on their level of motivation, on the other hand.
When confronted with less comprehensible questions, respondents with limited verbal skills (irrespective of their level of motivation) tended to provide nonsubstantive responses, whereas those with low motivation (irrespective of their verbal abilities) tended to provide neutral responses. It is conceivable that respondents with limited verbal skills prematurely decide that they do not have the necessary information to answer the questions if they already have problems to understand what these are about. Thus, these respondents may not even try to interpret the questions correctly but may satisfice instead by selecting a nonsubstantive response. On the other hand, respondents with low motivation to answer the questions may prematurely decide that they do not have or do not want to generate an opinion about the issue in question if understanding the question is burdensome. Hence, these respondents may satisfice by selecting a neutral response even though they may have been able to report an opinion. These issues call for future experimental studies that explore the underlying mechanisms that evoke these specific response strategies in more detail.
There are two limitations to this study. First, respondents were drawn from a nonprobability online panel, which may restrict the generalizability of the results. At the same time, the low response rate (10.9%) together with the low breakoff rate (7.2%) suggest that only a small proportion of highly motivated respondents participated in the survey. These respondents may have been less influenced by the incomprehensible questions than less-motivated respondents who may have exhibited even more satisficing behavior. Second, better-educated respondents were overrepresented in the sample. However, assuming that more-educated respondents are better and more competent readers, the question comprehensibility effects could have been even stronger if the sample had included a larger number of less-educated respondents.
The effects of question comprehensibility on response quality revealed in this study have some practical implications for the formulation of survey questions. Whenever possible, survey designers should try to minimize the cognitive effort required to comprehend a survey question by eliminating problematic text features such as low-frequency words, vague noun phrases, and complex syntactic structures. Specification of the text features and their relation to question comprehensibility may help practitioners systematically check and improve the comprehensibility of their questions. Manuals describing these features in detail may supplement the existing guidelines of asking questions and lend further precision to these rules.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
