Abstract
Survey researchers often ask a series of attitudinal questions with a common question stem and response options, known as battery questions. Interviewers have substantial latitude in deciding how to administer these items, including whether to reread the common question stem on items after the first one or to probe respondents’ answers. Despite the ubiquity of use of these items, there is virtually no research on whether respondent and interviewer behaviors on battery questions differ over items in a battery or whether interview behaviors are associated with answers to these questions. This article uses a nationally representative telephone survey with audio-recorded interviews and randomized placement of items within four different batteries to examine interviewer and respondent behaviors and respondent answers in battery questions. Using cross-classified random-effects models, the authors find strong evidence that there is more interviewer–respondent interaction on items asked earlier in the battery. In addition, interviewer and respondent behaviors are associated with both substantive and nonsubstantive answers provided to battery items, especially if the interviewer decided to reread or probe with the response options. These results suggest that survey designers should follow recommendations to randomize battery items and consider the importance of standardization of question administration when designing battery questions.
1. Introduction
Battery items are commonly used in interviewer-administered surveys, such as the General Social Survey (Smith et al. 2017) and the American National Election Studies (American National Election Studies 2017), and are fundamental to the creation of most scales, such as the Conflict Tactics Scales (Straus 1979). In a battery, a common question stem is presented before a list of items with a shared set of response options, often mentioning that additional items are forthcoming (Dillman, Smyth, and Christian 2014; Saris and Gallhofer 2007; Siminski 2008). The order of items within the battery may be randomized or fixed. The respondent may see this common stem and response categories on a show card in face-to-face surveys, but such a visual aid is not possible in telephone surveys.
Standardized interviewing guidelines instruct interviewers to deliver each survey question as written in an attempt to avoid off-script conversation and, in turn, to minimize interviewer influences on responses. Conversational interviewing, on the other hand, allows interviewers flexibility to assist respondents, with the goal of increasing understanding (e.g., Schober and Conrad 1997). Although the foundational texts on standardized interviewing (e.g., Fowler and Mangione 1990) describe how to administer individual items, they are silent about the special situation of batteries. In survey practice, interviewers are required to read the battery’s question stem and response options with the initial item or two, and then are allowed discretion to repeat the stem and/or response options as they think it is needed for later items (Dykema et al. forthcoming; Fowler 1995). Interviewers may also probe in a standardized way by repeating the question or the response categories rather than stating the question stem or response categories initially (Fowler and Mangione 1990). As such, the survey industry administers battery items as a compromise between strictly standardized interviewing (Fowler and Mangione 1990) and conversational interviewing (Beatty 1995; Schaeffer 1991; Suchman and Jordan 1990). In particular, battery items deviate from the prototypical standardized interviewing procedures (Cannell, Miller, and Oksenberg 1981; Fowler and Mangione 1990) in that interviewers are not clearly instructed in the questionnaire itself about when they should reread the introduction or the response options to the respondent (Dykema et al. forthcoming) and from prototypical conversational interviewing in that their follow-up options are scripted in a standardized interviewing fashion. This deviation is important because survey researchers may assume that the question stem for a battery is read as scripted only on the first item in a battery in a fully standardized manner. Yet survey practitioners train interviewers to administer batteries differently from how the questionnaire suggests, with optional rereading of the question stem and response options throughout the battery, similar to conversational interviewing. That is, the commonly accepted and ostensibly standardized procedure for administering battery items in today’s surveys is not clearly standardized or conversational, but somewhere in between.
Battery items provide an interesting case study for a type of item that requires being attuned to conversational cues (e.g., Suchman and Jordan 1990). For later items in the battery, repeating the question stem and response options on each item may be perceived as unnecessarily repetitive. In this case, interviewers may be perceived as breaking a fundamental rule of cooperative communication—that one should not give more information than is required (Grice 1975)—and the application of strict standardization by reading the full question stem on each item may undermine measurement quality. However, the conversational interviewing literature largely has been silent about batteries, primarily focusing instead on factual and autobiographical reports (Belli, Shay, and Stafford 2001; Conrad and Schober 2000; West, Conrad, Kreuter, and Mittereder 2018).
In addition, battery items put heavy demands on respondents’ working memory, especially in telephone surveys, where information is presented orally. In telephone interviews, respondents have to remember information from the introductory question stem and response options while processing subsequent items (Tourangeau, Rips, and Rasinski 2000). Respondents mentally insert each new item into the recalled introductory stem to have a complete question and then generate a response mapped to a recalled response option. There are no visual aids in a telephone survey, unlike self-administered surveys or show cards in face-to-face surveys. As a result, respondents may request help in remembering the question stem or the response categories. Thus, in telephone surveys especially, giving interviewers the discretion to reread the question stem and response options as needed allows them to react to any confusion or forgetfulness stemming from this high memory demand and burdensome question format.
Thus, survey practitioners face a quandary: how to design and administer battery items in telephone surveys to minimize unnecessary repetition, help respondents remember necessary information, and reduce the potential for interviewer effects on responses. To date, there is little research on how interviewers and respondents administer and answer these questions and how this administration may be associated with respondents’ answers. In this article, we empirically examine how interviewers and respondents administer and answer four different sets of battery items in a nationally representative telephone survey.
2. Background
2.1. Interview Behaviors and Item Location
It is unknown when interviewers use the discretion they are given in administering batteries or when respondents request help in answering these items. Thus, this is our first research question:
Standardized survey interviewers train respondents on how to answer survey questions and otherwise be a “good” respondent (Fowler and Mangione 1990). In battery questions, interviewers may need to read or repeat the question stem and/or response options on items early in the battery to help train respondents for later items. After a few repetitions, reading this information may no longer be needed as information that has been repeated is easier to recall (Peterson 1966). If interviewers do not read the optional information, respondents may have to ask for assistance (Dijkstra and Ongena 2006). In general, respondents need to remember and integrate multiple pieces of information, and they will have less practice with this on early items. On the other hand, interviewer rereadings and respondent requests for clarification may occur more frequently on later items as respondents get farther from having heard the question stem and response options (i.e., memory decay) and are more fatigued. Thus, (Hypothesis 1a) if respondent learning/training is occurring, we expect that there will be more conversational turns and higher rates of interviewers reading the introductory question stem and/or response options, probing by rereading the question stem and response categories, and respondents asking for the question to be repeated on items presented earlier in a battery than on items presented later in a battery. We also expect that the decline will be steeper for earlier items than for later items. Conversely, (Hypothesis 1b) if forgetting is occurring, we expect more of these behaviors on later items.
2.2. Item Location and Respondent Answers
Ideally, the location of the item in a battery is independent of the answers that are given to the item. Thus, this is our second research question:
Generally, respondents are thought to take cognitive shortcuts (i.e., satisfice) when they experience increased burden, working-memory challenges, and/or a series of items that are very similar to one another, all of which commonly occur in batteries (Alwin and Beattie 2016; Krosnick 1991; Krosnick and Alwin 1987), although empirically assessing satisficing is challenging (Alwin and Beattie 2016). There may be higher rates of “don’t know” (DK) or refusal (REF) responses on early items in the battery because respondents are doing extra cognitive work to learn the response options. Alternatively, because of fatigue, respondents may be more likely to acquiesce (i.e., say yes or agree with a statement regardless of the content of the item; Schuman and Presser 1981) or otherwise answer in ways that shortcut the response process (such as providing DK/REF answers) to reduce their burden on later items. Empirically, previous work has shown inconsistent or null findings on item content and location (Krosnick and Presser 2010).
Thus, (Hypothesis 2a) if respondent learning is occurring, we expect higher rates of DK/REF on early items in a battery but no clear pattern for item endorsement. (Hypothesis 2b) If satisficing and respondent fatigue is occurring, we expect higher rates of DK/REF and answers indicating acquiescence for later items in a battery. Furthermore, questionnaire design texts (e.g., Sudman, Bradburn, and Schwarz 1996) often suggest randomizing the order of items to avoid systematic context effects; (Hypothesis 2c) if this recommendation holds, we do not expect to see systematic differences in the levels of DK/REF or endorsement of particular items depending on their location.
2.3. Interview Behaviors and Respondent Answers in Battery Items
Understanding whether interviewer and respondent behaviors are associated with respondents’ answers is fundamentally important for understanding the implications of the quasistandardized battery question administration strategy that the survey industry has adopted. As such, this is our third research question:
Battery items’ unique features of high respondent working-memory demands and interviewer discretion in administering them interact to create a potentially problematic measurement situation. Battery items consistently have been shown to have lower reliability and validity than other items in a questionnaire (Alwin 2007, chap. 8; Alwin 2010; Alwin and Beattie 2016; Dykema, Schaeffer, and Garbarski 2016; Saris and Gallhofer 2007). For instance, Alwin (2007) compares questions in battery items that share a question stem and response options, questions in a topically related series that do not share question stems and response options, and questions that are not topically related to the surrounding items, and finds that items in a battery have lower reliability than those that are in a topically related series, which in turn have lower reliability than stand-alone questions. We know surprisingly little about why this is the case.
Battery questions are unique in that the variation in initial question reading is likely to be more constrained (e.g., reading the scripted question stem or the response options) than in other types of questions, and they may be used to head off potential respondent problems or confusion about the question. For instance, Dykema, Schaeffer, Garbarski, et al. (2016) found that respondents are less likely to exhibit problematic response behaviors when interviewers read optional parenthetical information in question stems versus not reading that information, a decision process similar to that for battery items. In addition, prior research has shown that question-reading behaviors can have downstream effects on respondent behaviors and answers and on interviewer behaviors, and they may influence rapport between the respondent and interviewer (e.g., Fowler and Mangione 1990; Garbarski, Schaeffer, and Dykema 2016; Holbrook et al. 2015, 2016). Thus, (Hypothesis 3a) we hypothesize that including the question stem and response options in the item administration reminds the respondent about the response task, decreasing the rate of final DK/REF responses.
To remedy respondent problems, interviewers often turn to probing behaviors. For instance, behaviors reflecting respondent problems with comprehension and mapping have been found to positively predict interviewer probing (Holbrook et al. 2016), and probing is associated with higher interviewer variance (Fowler and Mangione 1990; Mangione, Fowler, and Louis 1992) and response inaccuracies (e.g., Belli and Lepkowski 1996). Standard practice when interviewers receive a DK or, sometimes, REF response is to probe for a substantive answer, thus adding additional conversational turns because of the DK response itself, not causing it. As a result, (Hypothesis 3b) we expect that the number of conversational turns and interviewer probing will be associated with higher rates of DK/REF answers. Finally, (Hypothesis 3c) we expect respondent requests for help to be associated with lower rates of final DK/REF answers because respondents are likely to ask for help if they want to provide a substantive answer but are having difficulty doing so.
For substantive answers, it is harder to anticipate how the interviewer–respondent interaction will affect responses, but we recognize that it is critically important to understand whether these interviewer and respondent decisions are associated with substantive responses. Ideally, the initial reading of the question stem or the response options is not associated with substantive answers. To the extent that it is, reading the response options, especially when they are not scripted, may either make it easier to provide a less thoughtful response (reading the response options cues the respondents that a quick response that maps into one of those categories is desired) or may make it easier to provide a more thoughtful response (the time spent reading the response options can be spent thinking about the content of the question). For probing and respondent assistance, people who feel positively (or negatively) toward a particular attitude object may be more inclined to ask for assistance or answer in a way that requires probing. Alternatively, off-script interaction between interviewers and respondents may contribute to the attitude formation itself, although it is difficult to anticipate the direction in which this occurs. Furthermore, the interaction with the interviewer may motivate the respondents to do the hard work of answering survey questions and reduce acquiescence. For these reasons, we (Hypothesis 3d) evaluate whether measures of interviewer–respondent interaction are associated with survey responses but do not have concrete directional hypotheses.
3. Data
The data come from the U.S.–Japan Newspaper Opinion Poll, a dual-frame random-digit-dial telephone survey of U.S. adults in landline and cell phone households conducted during November 2013 (American Association for Public Opinion Research Response Rate 1 = 7.4%, N = 1,005). Gallup conducted the 13-minute-long general public opinion poll, which included questions about confidence in American institutions, attitudes toward various Asian countries, and demographics.
We examine four battery questions ranging in length from 6 to 14 items (Figure 1). Each respondent received the items in each battery in a different randomized order. Because of this important design feature, the item location within each battery can be disentangled from the content of the items themselves. The first battery (B1), the first question of the survey, had 14 items asking about confidence with institutions in American society. The next two batteries, asking about countries that will become a military threat (B2, 13 items) and issues about China (B3, 8 items), were positioned in the middle of the survey. The fourth battery, asking about North Korea (B4, 6 items), was administered at the end of the substantive questions before the demographic questions. All of the items within each battery have the same set of dichotomous substantive response options and standard (unread) DK and REF response options, although the substantive response options vary across the batteries (B1: do, do not; B2: yes, no; B3: concerned, not concerned; B4: yes, no).

The wording in four survey battery questions.
3.1. Item Location
The location of the items within the battery is a key independent variable. For each respondent in each battery, the first item that was asked received a 1, the second item asked received a 2, and so forth, resulting in a variable that captures the item’s randomly assigned cardinal position in the battery (Alwin 2007).
3.2. Interviewer and Respondent Behaviors
We examine six interviewer and respondent behaviors using behavior coding, a systematic coding of the interviewer and respondent behaviors in a survey interview (Fowler and Cannell 1996; Ongena and Dijkstra 2006). Due to the cost of transcribing and coding interviews, we selected a random subset of 467 audio recordings of the 1,005 interviews. Ten of these interviews were partial interview recordings, and 19 were conducted in Spanish, leaving 438 English-language interviews completed by 31 interviewers who had conducted at least 10 interviews, which facilitated model estimation (van Breukelen and Moerbeek 2013; Vassallo, Durrant, and Smith 2017). Each recorded interview was transcribed, and the transcripts were behavior coded by a team of trained undergraduates using the Sequence Viewer program (Dijkstra 1999).
We examine behaviors for each item answered across the four batteries. Although 438 interviews were processed, partial recordings, backups (respondents answering questions that had been asked previously), or interviewers not asking all of the battery items yielded slightly smaller analytic sample sizes in B1 (433 respondents), B2 (431 respondents), and B4 (432 respondents).
We code each conversational turn within a question-asking sequence on the actor (interviewer or respondent), his or her initial action (e.g., probing), and an assessment of the initial action (e.g., probing using response options). This initial behavior coding scheme did not adequately capture whether interviewers chose to read the question stem or response options in their asking of each item after the first (where it was required). Thus, a second set of independent coders evaluated whether the initial reading of each item within each battery included the question stem, the response options, or no additional information. Table 1 shows a behavior coding example from the first battery.
Behavior Coding Example
Two master coders independently coded 10 percent of the transcripts to assess reliability of the behavior codes. Kappas were 1.00 for the actor, 0.90 for the initial action for the respondents, and 0.95 for the initial action for the interviewers. For the specific sets of behaviors that we examine in this article, defined by combinations of initial actions and assessments, kappa values were 0.98 for whether the question stem or response options were included in the initial question reading, 0.76 for probing by reading the question, 0.69 for probing by reading the response options, and 0.47 for the respondent requesting the question or response options be repeated, all above the recommended threshold of kappa = 0.40 (Bilgen and Belli 2010).
First, we count the total number of conversational turns for each question–answer sequence for each administration of each battery item (Table 2). For paradigmatic sequences, we expect two or three conversational turns—a question administered by the interviewer, a respondent answer, and potential interviewer recognition of that answer (Schaeffer and Maynard 1996). The average number of turns (and range) for items in each battery is as follows: B1, 2.96 (2–31); B2, 2.54 (2–27); B3, 2.76 (2–23); and B4, 2.84 (2–35).
Mean Number of Conversational Turns and Interviewer Behavior Rates by Battery
Note: For introductory question reading, the question stem was required for the first item in the battery but not the second and later items. B = battery.
Next we create two dichotomous variables indicating if the question stem (= 1) or response options (= 1) were included versus not included (= 0) when the item was initially asked. Between 0.89 percent and 3.25 percent of initial item readings contained the question stem, and between 1.68 percent and 9.31 percent included the response options. The question stem was read for the first item in each battery for 100 percent of the respondents (although between 20 and 40 percent of the administrations were not read exactly as worded). The response options were a scripted part of the question stem for B1 and B4, and thus were read in all of the first question readings; B2 and B3 had response options of yes and no, which interviewers chose to include when reading the first item in the battery for 53.13 percent of the cases in B2 and 55.79 percent of the cases in B3 (an unscripted reading of the initial question stem for the first item).
We use two dichotomous indicators for whether the interviewer ever probed by repeating the question stem (1 = interviewer probed using question stem, 0 = no probe using question stem) or response options (1 = interviewer probed using response options, 0 = no probe using response options) on any conversational turn for each item in the battery. The specific behavior codes used are shown under the Probing heading of Table 2. Probing using the question stem occurred between 3.30 percent and 6.48 percent of items; probing using the response options occurred between 5.07 percent and 12.36 percent of items. Virtually every instance of probing using the response options repeated both of the dichotomous response options (less than 0.37 percent of items were probed using only one response option).
The final variable is a dichotomous indicator for whether respondents ever asked for the question or response options to be repeated or stated “What?” (= 1) versus not engaging in these behaviors (= 0). Respondents requested the question or response options be repeated on 2.84 percent to 6.91 percent of the items.
3.3. Respondent Answers
For respondent answers, we first create an indicator variable for whether the respondent provided a nonsubstantive answer of DK or REF (= 1) versus any substantive answer (= 0) for each item in the battery. Overall, item missing rates were quite low (less than 2.5 percent of all answers in any battery were DK/REF). Then, for each battery item with a substantive answer, we create an indicator of providing a final endorsement answer (do, yes, concerned = 1) versus a nonendorsement (do not, no, not concerned = 0) answer; nonsubstantive answers are set to missing. Overall endorsement rates varied substantially across batteries.
3.4. Control Variables
Other factors may also influence interviewer and respondent behaviors or respondent answers. First, we statistically adjust for a set of measures of the complexity of each item in the battery. Items are typically similar to one another in length within a battery, an important predictor of reliability and validity in battery items (e.g., Alwin 2007, 2010; Alwin and Beatty 2016). To account for this, we include the number of words in each item. Because the items are randomized, and thus respondents differ in the “first item” where the question stem is always read, our measure of question length includes the full question stem for each item. The average number of words ranges from 20 to 30 words (see Table A1 in the online appendix). We include the Flesch-Kincaid reading grade level, measured using Microsoft Word (Olson and Smyth 2015; but see Lenzner 2012, 2014). The average item’s Flesch-Kincaid reading level ranges from 6.29 (B3), indicating a reading level around sixth grade, to 12.48 (B4). Three linguistic measures obtained using the online Question Understanding Aid (QUAID) tool (Dykema et al. forthcoming; Graesser et al. 2006) include whether the item contains an unfamiliar technical term (21.5–100 percent of items), a vague or imprecise relative term (including vague quantifiers; approximately 15 percent of items in three of the four batteries; no items in the fourth battery), or a vague or ambiguous noun phrase (including nouns with multiple meanings; approximately 20 percent of items in two batteries and none in the other two). None of the items were identified to have complex syntax as defined by QUAID, and QUAID identified only one item as “working memory overload”; these question characteristics are omitted.
Second, respondents with lower working-memory capacity and cognitive ability, including older adults and those with lower education levels, tend to show more interactional problems and lower data quality (Knäuper et al. 1997; Krosnick 1991; Narayan and Krosnick 1996). Thus, we control for age and education. Regarding age, respondents were asked, “What is your age?” and answers were coded into categories of less than 40 (21.4 percent), 40 to 65 (48.0 percent), and greater than 65 (30.6 percent). For education, respondents were asked, “What is the highest level of education you have completed?” Answers were categorized as less than high school graduate (5.3 percent), high school graduate (21.0 percent), some college (21.7 percent), or college degree or higher (52.1 percent). The missing data rate on both age and education was 1.6 percent; we imputed the modal category for missing values on these variables.
Third, interviewer experience—both overall and accumulated over the course of a field period—is associated with shorter survey length (Olson and Peytchev 2007). In addition, interviewers with more experience elicit more acquiescent responses than those with less experience (Olson and Bilgen 2011). Thus, we control for two types of interviewer experience. The first experience variable is length of employment at Gallup, measured as less than one year at the job (= 0; 48.9 percent) versus one year or more at the job (= 1; 51.1 percent). The second is experience on this particular survey, measured by a count variable capturing the order in which each interviewers’ surveys were completed (i.e., 1 = first interview, 2 = second interview; mean = 15.8; range = 10–25).
3.5. Analysis Methods: Cross-classified Random-effects Models
The behaviors and answers for each battery item are answered by each respondent, and respondents are nested within interviewers. Thus, four-level cross-classified random-effects models allow us to take into account the nesting of behaviors or responses to items in a battery within interviewers, questions, and respondents. For analyses of the behaviors and DK/REF responses, these models are estimated by combining all four batteries together, yielding a total of 17,761 items (16,027 items when the first item is excluded) across the four batteries and 438 respondents.
To evaluate Research Question 1, we use cross-classified random-effects logistic regression models to predict introductory question reading, probing, and respondent requests for assistance on any item within a battery, and cross-classified random-effects linear regression models to predict the total number of conversational turns. 1 To examine Research Questions 2 and 3, we predict nonsubstantive answers and substantive answers using slightly different approaches. First, we examine whether respondents gave a final nonsubstantive answer (DK/REF = 1) versus any substantive answer (= 0) through a cross-classified random-effects logistic regression model, including all of the batteries in the same model. Among respondents who offered a substantive response, we predict that they will provide a final endorsement (do, yes, concerned; coded 1) versus a nonendorsement (do not, no, not concerned; coded 0) response using cross-classified multilevel logistic regression models. For this set of analyses, because each battery has a different set of response options, we estimate models for the four batteries separately. We estimate these models using Stata 15.0 and the mixed command with restricted maximum likelihood (number of conversational turns) and meqrlogit command with a QR decomposition to estimate the variance components (interviewer and respondent behaviors; survey responses). Variance tests overall and for individual variance components were conducted using both likelihood ratio tests and a mixture of chi-square distributions, as shown in Table 3 (Rabe-Hesketh and Skrondal 2012a, 2012b). Stata code is available from the authors on request.
Variance Parameter Estimates for Interviewer and Respondent Behaviors, Base Models with No Covariates and Full Models
Note: Variance tests overall and for individual variance components conducted using a likelihood ratio test and a mixture of chi-square distributions. Variance parameters for the number of conversational turns estimated from a linear model; variance parameters for the interviewer and respondent behaviors estimated from a logistic regression model. The variance partition coefficient is the proportion of the total variance associated with each level in the model from the base (empty) model. The full model contains respondent, interviewer, and question characteristics.
p < .05. **p < .01. ***p < .001. ****p < .0001.
For our base model for number of conversational turns, for example, we predict
In our base models, we are interested in the variance partition coefficients (VPCs)—the proportion of variance for each level out of the total of the variance components (Goldstein, Browne, and Rasbash 2002)—and the intraclass correlation coefficients (ICCs)—the level of homogeneity of behaviors within each item, respondent, and interviewer). For example, the VPC for interviewers is calculated as follows:
By assuming the error variance
We start by examining base (empty) models for each of the behaviors. First, we examine whether the question stem was included (= 1) versus not included (= 0) when the item was initially asked, and the same for the response options. Interviewers vary widely in introductory question reading behaviors of including the question stem (VPC = 27.0 percent, p < .0001) and response options (VPC = 43.7 percent, p < .0001) for Items 2 and later in the battery, reflecting interviewer discretion on these items. Interviewers vary to a lesser degree in the total number of conversational turns (p < .0001), the use of probes with response options (p = .0002), and for respondents requesting a repeat of the question (p = .002) (VPCs range from 0.4 percent to 2.2 percent), and interviewers do not vary for probes with the question stem (p = .28). Respondents contribute a significant portion of variability in all of the behaviors (VPCs range from 10.0 percent to 15.7 percent, p < .0001), as do the individual items (VPCs range from 1.8 percent to 8.6 percent, p < .0001). These findings suggest that a multilevel framework is appropriate; although the interviewer variance term is not significantly different from zero in the base model for probing using the question stem, we include interviewer, question, and respondent random effects in all analyses for consistency.
The key independent variable for Research Questions 1 and 2 is item location. We include a log-transformed variable for item location to examine whether there was a decline or increase in any of the outcomes over the course of the battery items (the log transformation showed better model fit than a linear term). We also include indicators for the four different batteries to account for differences in battery content, an interaction term between battery and item location to permit differences across the batteries in the rate of decline or increase, the question complexity measures, respondent age, respondent education, and both measures of interviewer experience. For instance, using the above notation, the model predicting the total number of conversational turns for item j1, respondent j2, and interviewer k, is
Appropriate adaptations are made to the model to predict
For Research Question 3, we expand the models from Research Question 2 to include the interviewer and respondent behavior variables as predictors of final DK/REF and endorsement responses. The complete model for DK/REF answers is
Because the total number of turns and the individual behaviors are correlated (e.g., probing occurs only on an additional conversational turn), we first estimate models with just the individual interviewer and respondent behaviors (Model 1) and then add the total number of conversational turns to the model (Model 2). Because the initial question-stem asking behaviors are constant for all respondents for all items asked as the first in the battery for all batteries, and including the response options in the initial asking are constant for all respondents for two batteries, we estimate models separately for responses to all items and for the second and later items in the battery (Models 3 and 4). The four batteries are examined separately for substantive responses.
4. Findings
4.1. Research Question 1: Do Interviewer and Respondent Behaviors in Battery Questions Differ by the Location of Items in the Battery?
Figure 2 shows the average number of conversational turns for each item in the battery by item location. Although the mean number of turns differs across the batteries, in all four batteries, the pattern is identical: items administered earlier in the battery have more conversational turns than items administered later in the battery (Table 4; log[location], p < .0001; battery*log[location], p < .0001). The trend is especially strong for B1, the first and longest battery administered and the first question in the questionnaire, and it is somewhat attenuated for the other batteries. This pattern is independent of the content of the items, due to the randomization of items across locations. Thus, the learning/training hypothesis (Hypothesis 1a) is supported: interviewers and respondents need more conversation on early items in the battery than on later items.

The mean number of conversational turns by item location.
Test Statistics for Item Location Predicting Interview Behaviors and DK/REF Responses
Note: Tests reported are from multilevel models accounting for clustering of responses within items, interviewers, and respondents, and account for question characteristics, respondent characteristics, and interviewer experience. DK/REF = “don’t know”/refusal.
Now we turn to individual interviewer and respondent behaviors. We present the full models predicting the initial question reading behavior in Table 5. (Other behaviors are presented in the online appendix.) Once again, our hypotheses about differences in interviewer and respondent behaviors across item location in the batteries are confirmed. In all four batteries, there is significant decline (p < .0001) for all of the behaviors across item location in the battery (see again Table 4). Figure 3 shows that there are higher rates of including the question stem in the initial reading, reading the response options, probing with the question stem, and probing with the response options for items asked early in the battery. Respondent requests for repeats of the question follow the same pattern (Figure 4). This decline in all of the behaviors is also consistent with the learning hypothesis (Hypothesis 1a). In addition, interviewer probing and respondent requests to repeat questions or response options increase the number of conversational turns, helping explain why the early items have longer interactions.
Cross-classified Logistic Regression Coefficients and Standard Errors Predicting Interviewer Introductory Question Reading
Note: AIC = Akaike information criterion; BIC = Bayesian information criterion.
p

Percentage of respondent behaviors by item location.

Percentage of respondent requests for repeat by item location.
Looking again at Table 3, we see the variance components for the full models and the percentage reduction in variance from the base model. Between 43.5 percent and 100 percent of the variance due to questions was explained by the covariates included in the models. The covariates included in these models failed to explain much of the interviewer-level and respondent-level variation, and even increased either the interviewer-level or respondent-level variation for five of the six behaviors.
4.2. Research Question 2: Do Answers Provided by Respondents in Battery Questions Differ across the Location of Items in the Battery?
4.2.1. DK/REF answers
As shown at the top of Table 6, in the base model, there is significant question and respondent-level variance for final DK/REF answers (p < .0001), but the interviewer-level variance in DK/REF answers is not different from zero (likelihood ratio, χ2 = 0.00, p = 1.00).
Variance Parameter Estimates for DK/REF and Substantive Answers, Base Models with No Covariates and Full Models
Note: All items included in models. Variance tests conducted using both a likelihood ratio test and a mixture of chi-square distributions. Base models are estimated without any covariates. Full models correspond to Model 2. DK/REF = “don’t know”/refusal; B = battery; VPC = variance partition coefficient, or the proportion of the total variance associated with each level in the model.
p < .05. **p < .01. ***p < .001. ****p < .0001.
Figure 5 shows the percentage of DK/REF responses at each item location by battery. For all items in all locations, the DK/REF rate does not exceed 4.2%, making a DK/REF answer a rare event (virtually all of these nonsubstantive responses are recorded as DK responses). There is a negative significant association between log(location) and DK/REF answers for B1, B2, and B3 (log[location] coefficient = −0.466, p < .0001), but a positive association for B4 (B4*Log[location] coefficient = 0.801, p < .01; see again Table 4). That is, in the first three batteries, items presented later in the battery have lower rates of DK/REF responses, consistent with a learning hypothesis (Hypothesis 2a). In the last battery, B4, DK rates increase for items asked later in the battery, perhaps because this battery was difficult and later in the questionnaire, triggering fatigue (Hypothesis 2b).

Percentage of “don’t know”/refusal responses at each item location based on four battery questions.
4.2.2. Substantive answers
Across the four batteries, there is virtually no interviewer-level variance in the substantive answers provided (VPC < 0.01 for all batteries; p < .05 for B1 only; Table 6, base models). However, there are significant item-level and respondent-level effects on responses (p < .0001).
The association between item location and providing an endorsement response varies across batteries (Figure 6). Endorsement (saying “do”) increases across items in B1 (coefficient = 0.095, p = .025) but decreases across items in B2 (coefficient = −0.134, p = .01) and B4 (coefficient = −0.198, p = .031), with yes responses indicating endorsement in B2 and B4. The increased rates of do responses in B1 may indicate increased acquiescence for later items (Hypothesis 2b), but we see less acquiescence on later items in the two batteries asking for a yes response. In sum, there is evidence that the order in which items are presented affects DK/REF and substantive responses, independent of the content of those items. The common recommendation for questionnaire designers to randomly rotate items spreads the “location error” across items, rather than concentrating in a particular item.

Percentage of substantive responses by item location, based on four battery questions.
4.3. Research Question 3: Are the Answers Provided by Respondents in Battery Items Associated with Interviewer and Respondent Behaviors on Those Items?
4.3.1. DK/REF answers
Now we examine whether respondents provide different answers for different types of interviewer and respondent behaviors. Table 7 shows model coefficients from four models predicting final DK/REF answers across the four batteries.
Cross-classified Logistic Regression Model Coefficients and Standard Errors Predicting Response of DK/REF (= 1) versus Substantive Response (= 0) with Interviewer and Respondent Behaviors across the Four Batteries
Note: DK/REF = “don’t know”/refusal; B = battery; AIC = Akaike information criterion; BIC = Bayesian information criterion.
p < .05. **p < .01. ***p < .001. ****p < .0001.
First, initial question asking behavior, including either the question stem or the response options, is not associated with DK/REF answers, counter to Hypothesis 3a. Next, probing by repeating the question stem, probing by repeating the response options, and larger numbers of conversational turns are associated with higher rates of DK/REF reports, confirming Hypothesis 3b. For example, each additional conversational turn is associated with a 31 percent increase (Model 2: e0.270 = 1.310, p < .0001) in the odds of providing DK/REF reports. In addition, probing using the question stem or the response options during the question yields a 138 percent and 498 percent increase (Model 2: probe question stem, e0.868 = 2.382, p < .0001; probe response options, e1.789 = 5.983, p < .0001) in the odds of DK/REF responses. As noted earlier, these effects may be somewhat circular—interviewers are trained to probe when an initial DK/REF is offered by the respondent, thus adding conversational turns and occurring because of the initial DK/REF answer. When we include the total number of conversational turns, the association between probing by restating the question stem and DK/REF answers disappears, but the association with probing using the response options stays. Thus, even with additional probing by restating the response options and more conversation, the interaction still resolves in a final DK/REF response.
Finally, as hypothesized (Hypothesis 3c), when respondents ask for help with the response options or question stem, they are significantly (p < .01) less likely to provide a DK/REF answer than a substantive answer. For example, the odds of a DK/REF answer are (Model 2: e−0.745 = 0.475, p = .005) about 53 percent smaller on items in which the respondent requests assistance than those in which they do not request it. Thus, respondents who request assistance are trying to answer with a substantive response rather than a nonsubstantive response.
The included set of covariates explained 44.2 percent of the question-level variation in DK/REF final answers and 23.7 percent of the respondent-level variation (see again Table 6).
4.3.2. Substantive answers
We now look at whether the behaviors during the interview affect the substantive answers (Hypothesis 3d). Because the four batteries are examined separately, the sample size is reduced in each model, especially for less frequent behaviors. However, understanding the risk for behaviors on these battery items to affect survey answers is important. Thus, for this analysis, we report p < .10 as statistically significant, and report the odds ratios to interpret the magnitude of the effect. Results are shown in Table 8.
Coefficients from Cross-classified Logistic Regression Models Predicting Substantive Responses of “Do” (B1), “Yes” (B2), “Concerned” (B3), and “Yes” (B4), Excluding DK/REF Responses, with Interviewer and Respondent Behaviors
Note: B = battery: DK/REF = “don’t know”/refusal.
p < .10. *p < .05. **p < .01. ***p < .001. ****p < .0001.
First, simply restating the introductory text and/or the response options affected the endorsement rate in three of the four batteries (p < .07). In B1, the odds of endorsement increased by 49 percent when the interviewer restated the response options (Model 4: e0.401 = 1.493, p = .06). In B2, the odds that respondents endorsed an item after the second item increased by 160 percent (Model 4: e0.957 = 2.604, p = .048) when the interviewer included the question stem and by 85 percent when the interviewer included the response options (Model 4: e0.615 = 1.850, p = .069). In B4, the odds of endorsement decreased by 37.5 percent (Model 4: e−0.470 = 0.625, p = .057) when the interviewer included the response options. This is striking.
Probing by repeating the response options is related to substantive answers, but there is no association between probing by repeating the question stem or between respondent requests for the question to be repeated and substantive answers. In B2, when respondents are probed with the response options, the odds double (Model 2: e0.972 = 2.643, p < .0001) that they will answer that a given country will become a military threat. In B4, when respondents are probed using the response options, the odds are (Model 2: e-0.698 = 0.498, p = .008) about 50 percent less that the respondent will indicate that the U.S. and Japanese governments should work together to give priority to resolving a particular policy item about North Korea. Both of these items have yes/no response options, although other characteristics of the batteries are quite different. Thus, probing by providing the response options is associated with differences in substantive responses, but requests by respondents for the question to be repeated seem to come equally from the positive and negative sides of the attitudinal domains.
The number of conversational turns is significantly associated with substantive responses in only one battery. In B3, more conversational turns are associated with lower rates of reports of being concerned about China (Model 2: coefficient = −0.099, p = .029).
Finally, we examine the proportion of the variance components at each level that were explained by the included covariates. Although there was no interviewer-level variance for B2, B3, and B4, 12.3 percent of the (small) interviewer-level variance in substantive responses was explained for B1. In B1 and B2, just over one third of the question-level variance was explained by the included covariates; this increased to over 80 percent of the question-level variance for B3 and B4. Finally, between 1.5 percent and 7.0 percent of the respondent-level variance was explained across the four batteries.
5. Discussion
This article examined the association between item location, interviewer and respondent behaviors, and answers provided to battery questions in a telephone survey. Because the items were randomized within batteries, we can disentangle the location of a battery item from its content. In general, the results for item location support a learning hypothesis. All of the interviewer–respondent behaviors occurred at higher rates on items earlier in the battery than on items later in the battery. In addition, respondents had higher rates of DK/REF answers to items presented early in the battery and lower rates of endorsement (less acquiescence), with one exception (one battery showed higher rates of endorsement for later items).
Importantly, interview behaviors are associated with respondent answers in battery questions. Items on which there is more interaction overall have higher rates of DK/REF responses but similar or slightly lower endorsement rates (on one battery). Additional conversational turns can be driven by inadequate responses (such as DK/REF) as standardized interviewers are trained to probe these answers. Yet even with this higher rate of interaction, the answers still resolve to a DK/REF rather than a substantive answer. Thus, it appears that respondents are not forgetting the question stem or response options to these questions but genuinely do not know (or do not want to provide) an answer. Interviewer reading of the question text or the response options when presenting the items was significantly associated with endorsement rates in three of the four batteries. For example, in the first battery about confidence in American institutions, confidence in federal government agencies increased from 38.2 percent when the introductory question stem was not included to 53.3 percent on when it was included. 2 Items on which interviewers probed using the response options had both higher DK/REF rates and different substantive responses (albeit with opposite directions across batteries). For example, when asked whether India will become a military threat to the United States, 12.2 percent of respondents who were not probed with the response options indicated that it would, compared to 56.3 percent who were probed using the response options. It is not clear why the interaction is associated with respondent answers, but it is clear that respondents with increased interaction with the interviewer provided different answers. Respondents who asked the interviewers to repeat the question stem or the response options were less likely to provide a DK/REF answer than those who did not ask the interviewers to repeat this information, but this action was not associated with substantive answers. This suggests that requests for help with the question indicate that the respondent is trying to provide a substantive report rather than having problems.
Although there is variation in the associations between the behaviors and the substantive responses, the takeaway is that both DK/REF and substantive responses were significantly associated with at least one of the interviewer behaviors—whether question reading, probing, or simply more conversational turns. We encourage future research on this topic to further unpack this result.
So, what does this mean for data quality? Battery items as they are currently administered allow a degree of discretion over reading the question stem and/or response options not seen in other types of questions. Reports to battery questions that respond to simple behaviors by the interviewer or are associated with behaviors by the respondent are likely to be less reliable and/or valid (Krosnick and Abelson 1992) than responses that are not affected by these behaviors. As noted earlier, battery items have lower reliability and validity than other types of items; some of this lack of reliability or validity could arise because of these types of interviewer behaviors. Unfortunately, we cannot directly test this hypothesis; we do not have reinterviews with the same respondents, so we cannot directly evaluate reliability of attitudes over time with these data (for examples, see Alwin 2007; Hout and Hastings 2016). We might anticipate seeing more changes in answers over time to battery questions where the interviewer probed respondent answers using the response options, suggesting weaker or less crystallized attitudes (Smith 1985). Alternatively, increased interaction through reading or repeating the question stem may have led to a more complete understanding of the question; thus reports to these items may be more “accurate” and thus more reliable. Future research on reliability of responses to battery questions over time would benefit from incorporating indicators of these behaviors.
Although this study was the first of its kind to examine the location of items in a battery and interviewer and respondent behaviors, we examine the behaviors largely in isolation, and not as part of a sequence of behaviors that unfold during the conversation for each item. The majority of items have a paradigmatic question-answer sequence (68.2 percent), with a sizable group having a paradigmatic question-answer-feedback sequence (15.1 percent). Across all of the items and across all of the batteries, 16.7 percent of the items have more than three conversational turns, indicating deviations from a paradigmatic interviewer–respondent interaction sequence. This interaction could be used to navigate the complexities of the battery items, but it could also be used to build rapport (Garbarski et al. 2016). A sequential examination of how interviewer discretion affects respondents’ behaviors is a useful step for future research.
As an actual production survey, this article provided insights into how these items are actually administered. Although the items within each battery were randomized, the batteries themselves were not randomly presented within the survey. Thus, we cannot evaluate whether differences across the batteries occur because of the placement of the batteries or their content. Future experiments should rotate both items within batteries and batteries within the survey. In addition, only interviewers with 10 or more interviews were selected for behavior coding. This assists with model estimation, but it limits the inference to these interviewers. Furthermore, although the location of the items within the batteries is randomly assigned, the interviewer and respondent behaviors are not randomly assigned. We control for the theoretically selected effects of respondent age and education and interviewer experience, and a variety of measures of question complexity, but there could be other interviewer and respondent factors associated with both the behaviors and the survey responses that are contributing to these results. For instance, some interviewers may feel that rereading the question stem or response options or probing helps with building rapport, and thus a future study that has an interviewer-level measure of desire for rapport may provide insights (Garbarski et al. 2016; West and Blom 2017). Although this survey provided a useful case study, the batteries all used dichotomous response categories and had topics that may not be familiar for some respondents. Future work should examine batteries with ordinal scales and with other battery topics. We also focused primarily on interviewer behaviors here, with one respondent behavior related to requests for question information to be repeated. Future work should examine respondent behaviors in more detail, including how they may be associated with or trigger interviewer behaviors. Finally, this analysis was conducted on a dual-frame telephone survey; future research should examine how batteries operate in face-to-face surveys, the mode used in many of the existing evaluations of reliability and validity (Alwin 2007).
The implications for questionnaire design are clear. First, items within batteries should be randomized as interviewer and respondent behaviors and survey responses vary by location of the item. Although this is a common practice, it is not ubiquitous. Second, if keeping the question stem and response options in mind is important, questionnaire designers should write the questions as they want them to be asked, rather than relying on the interviewer’s judgment or discretion for restating the question stem or response options, keeping in mind the balance between information and repetitiveness. This study suggests that scripting these questions may, in some instances, shift overall distributions of responses. Yet the shift will correspond to an intentional decision by a survey designer and apply to all respondents equally rather than an ad hoc decision by an interviewer. Future studies should examine what happens to the interaction between interviewers and respondents and responses to survey questions when the question stem and/or response options are read for every question or scripted for a subset of item locations, keeping in mind the challenges of programming a computerized telephone instrument when the items themselves are randomized.
The implications for training and monitoring interviewers are less clear. Significant interviewer-level variance remains in initial question asking behaviors, suggesting that interviewers strongly deviate from standardization in their administration of battery items. If survey organizations want to minimize this variation without scripting question wording, interviewers could be more closely monitored on administering battery items. Monitoring data could be used to estimate these types of models during field operations, with the goal of minimizing interviewer variance. Interviewers with particularly large random interviewer effects could be targeted for retraining or intervention. In addition, standardized interviewing practice dictates that interviewers should probe DK responses, but probing using the response options was associated with substantive responses in two batteries. Developing effective interviewer training will require future insights into the specific cues provided by respondents during their interaction with the interviewer when providing uncodable responses that might resolve to a (valid) substantive response. In sum, this article was the first to study how battery questions are currently being administered in telephone surveys, including examining the roles of item location and interviewer–respondent behaviors. It provides us with an initial insight into potential causes of reduced reliability and validity in battery questions. Future research should continue to explore these questions.
Supplemental Material
Battery_Tables_REV2_APPENDIX – Supplemental material for Item Location, the Interviewer–Respondent Interaction, and Responses to Battery Questions in Telephone Surveys
Supplemental material, Battery_Tables_REV2_APPENDIX for Item Location, the Interviewer–Respondent Interaction, and Responses to Battery Questions in Telephone Surveys by Kristen Olson, Jolene D. Smyth and Beth Cochran in Sociological Methodology
Footnotes
Acknowledgements
An earlier version of this article was presented at the 2015 Annual Meeting of the Midwest Association for Public Opinion Research and the 2016 Annual Meeting of the American Association for Public Opinion Research conference. We thank participants at these meetings for helpful comments. We also thank four anonymous reviewers for comments that substantially improved the article. The data and code used in this article will be made available through the Inter-university Consortium for Political and Social Research by December 2018.
Funding
This work was supported by the National Science Foundation Grant No. SES-1132015. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Notes
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
