Abstract
The Internet is used more and more to conduct surveys. However, moving from traditional modes of data collection to the Internet may threaten the comparability of the data if the mode has an impact on the way respondents answer. In previous research, Revilla and Saris (2012) find similar average quality (defined as the product of reliability and validity) for several survey questions when asked in a face-to-face interview and when asked online. But does this mean that the mode of data collection does not have an impact on the quality? Or may it be that for some respondents the quality is higher for Web surveys whereas for others it is lower, such that on an average the quality for the complete sample is similar? Comparing the quality for different groups of respondents in a face-to-face and in a Web survey, no significant impact of the background characteristics, the mode and the interaction between them on the quality is found.
During the past few decades, the number of surveys implemented around the world increased a lot. If surveys were for a long time the relatively closed domain of a few scientists, nowadays many people are able to launch their own survey.
This democratisation of the survey practice has been accompanied by increasing concern about the representativeness and the quality of different surveys. If many people are able to conduct surveys, not all of them can do a “good” survey. Many online surveys are everything but representative. Therefore, it is necessary to be careful about some of the claimed results (Saris, 2008).
However, using the Internet to conduct surveys is attractive since, in principle, it can be both quicker and cheaper than more traditional modes, even if in practice that is not always the case. High quality surveys such as the European Social Survey (ESS) have started to consider the possibility of switching from their current mode of data collection to Web surveys or to a mixed-mode approach including the Web. The mixed-mode approach has the advantage that the non-Internet users – that still represent a non-negligible part of the population – can participate via another mode. However, introducing the Internet may threaten the comparability of the data both across time and across groups (or countries, if not all the countries adopt the same mode, or subpopulations that answer in different modes, if a mixed-mode approach is used in one country).
Because of both the attractiveness and the risks associated with Web surveys, an important literature began comparing Web to other modes of data collection. The comparisons focused mainly on the response rates and non-response (Fricker et al., 2005; Kaplowitzet al., 2004) and on satisficing and social desirability (Heerwegh, 2009; Kreuter et al., 2009) as indicators of quality. Satisficing and social desirability may be observed in all modes, but they are expected to vary because of the presence of the interviewer in some modes and not in others.
Nevertheless, low response rates are only a warning of potential troubles (Couper and Miller, 2009): they do not systematically correspond to low quality. On the other hand, higher response rates imply neither higher representativeness, nor higher quality (Krosnick, 1999). The central question is whether higher response rates also mean less non-response bias (Voogt and Saris, 2005). Satisficing and social desirability are specific to certain kinds of questions and, as such, are not adapted for measuring the quality for all topics.
On the contrary, following Saris and Andrews (1991), Scherpenzeel (1995, 2008) uses a measure of the quality (product of reliability and validity) that can work for all topics and, moreover, allows correcting for measurement errors. This is crucial because there are always errors in the measurement and if this is not taken into account, the conclusions drawn may be wrong. The presence of random errors can attenuate the observed correlations between variables. The presence of systematic errors can lead to overestimated observed correlations. Different groups can have different levels of both random and systematic errors, forbidding any direct comparison across groups. It is therefore useful to look at the quality, defined as the strength of the relationship between the latent variable of interest and the observed answer, to get an idea of the potential measurement error and if necessary correct for it.
Defining quality in the same way, two papers (Revilla, 2010; Revilla and Saris, 2012) recently focused on the impact of the mode, or combination of modes, of data collection on the quality of answers to survey questions. The main result is that the quality is very similar in the face-to-face and the Web surveys compared. From that, Revilla concludes “that there is only a slight impact” on the quality when switching from a unimode to a mixed-mode design for the data analysed (Revilla, 2010: 163).
This conclusion may be a bit too optimistic: does the finding of a similar average quality in both modes really allow us to conclude that the mode of data collection does not have an impact on the quality?
What is true at the aggregate level is not necessarily true at the micro level. If the average quality of a sample of face-to-face respondents equals the average quality of a sample of Web respondents, does that mean that the quality of answers of respondent i remains the same if respondent i takes a face-to-face or a Web interview? An implicit assumption made by the authors is that the impact of the mode of data collection is the same for all respondents. But what if for some respondents the quality is higher in Web than in face-to-face interviews, whereas for others it is the contrary?
The goal of this paper is to test if the assumption of equal impact of the mode of data collection on all respondents does or does not hold. Investigating in each mode if differences are found between different kinds of respondents is a second topic of this paper. We focus on two modes: Web, because of its impressive growth during the past few decades and the huge possibilities it offers; and face-to-face, because it is still nowadays seen as the gold standard for survey research.
The “(In)equal Impact of the Mode of Data Collection” section discusses the assumption of equal impact of the mode on all respondents. Then, the “Hypotheses” section proposes a set of hypotheses. The “Method” section explains the model used to test these hypotheses, while the “Data” section gives information on the data. Finally, there are the “Results” section and the “Conclusion” section.
(In)equal Impact of the Mode of Data Collection on the Quality Depending on the Respondent Characteristics?
The assumption of equal impact of the mode on all the respondents is in line with a view of quality used for instance by Saris and Gallhofer (2007). In this view, the quality is considered to be a property of the questions per se. Therefore, the quality may be influenced by elements such as: use of battery or separate questions, number of response categories, use of labels, etc. The topic and the visual presentation of the question (horizontal versus vertical scales, use of images) are also considered as potentially influencing its quality (Dillman and Christian, 2005; Toepoel et al., 2005).
Nevertheless, one could argue that the quality depends not only on the question’s properties but also on how these properties are perceived by the respondents. The quality may therefore be seen as the result of an interaction between a question’s properties and the characteristics of the respondent. If an interviewer is present, a third aspect may even be considered.
Some research has already been done on the impact of respondent characteristics on the quality. For instance, Alwin and Krosnick (1991) use a simplex model to look at the impact of schooling and age on the psychometric concept of reliability 1 and find that “older respondents and those with less schooling provided the least reliable attitude reports” (from the abstract). Their results suggest that characteristics of the respondents may be an element to consider when studying quality. However, they only consider reliability and not the total quality (q 2 ); that is, the product of reliability (r 2 ) and validity (v 2 ). Besides, they do not take the mode of data collection into account.
A study by Andrews (1984) does consider the mode of data collection and separate validity from method effects and residual errors. Andrews concludes that “respondent characteristics were not a major predictor of variation in the quality of measurement in these data” (1984: 433). Nevertheless, some effects of age and education are found. Also, Andrews reports a very small effect due to the mode of data collection. But the comparison was between group-administered questionnaires, telephone and face-to-face interviews.
Following this idea, we wanted to see if the mode of data collection interacts with some characteristics of the respondent to determine the quality, such that for respondents with some characteristics, switching from a face-to-face to a Web survey would increase the quality of their answers, whereas for respondents with other characteristics, it would decrease. If this is the case, a similar average quality across samples interviewed with different modes does not imply that the mode has no impact on the quality. It may have a different impact on different groups.
Why is it important to know if this happens? It is important because the correlations and the analyses based on correlations may be biased if differences in quality exist across respondents or for the same respondent across time. Different situations may be thought of where problems could appear due to that variation of quality. A few examples are presented below.
First, imagine that one wants to study time series using respondents that at time t-1 answered by face-to-face and at time t answer online, and that depending on their level of schooling the quality for some respondents increases (high educated) when switching to the Internet whereas for others (low educated) it decreases. Then, when comparing the answers of one respondent at times t-1 and t, one would get confounding effects of variations in modes and true variations in respondent opinions.
Second, one can think about what could happen if one does a survey of a specific population: for example, it is quite usual, for practical reasons, to conduct surveys on a population of only students (Heerwegh and Loosveldt, 2009; Smyth et al., 2008). Then, even if the quality in different modes is similar for samples representative of the whole population, if different subpopulations produce different qualities when answering in different modes, studies focusing on these subpopulations may suffer from the switch in modes. It may be that using a face-to-face interview or a Web interview will not lead to the same quality for a student-based survey if students (because of their age or level of education) react differently to the different modes.
Finally, even using a population-based sample, if different modes are used for different respondents of the sample (mixed-mode survey) and if respondents with different backgrounds have the tendency to choose different modes, then it may be problematic to study relationships influenced by these background variables. For instance, if one wants to study in a mixed-mode survey, relationships influenced by age, and the quality varies in different modes for different age groups and these different age groups choose mainly different modes (for example, younger people choose the Web and older people face-to-face), the conclusions may be incorrect if no correction for variation in modes is done.
Hypotheses
First, we should mention that we focus on what we call “normal questions”, meaning questions that are neither very complex, nor very sensitive. These questions may have different characteristics that impact the quality. But for complex and sensitive questions, more differences in quality can be expected across modes.
In face-to-face interviews, the skills that the respondents need to answer normal questions are quite limited. They have to understand the question and give a response. But the respondents should only say their answer, they do not have to do any manipulations such as check a box: the interviewer does this for them. Therefore, the second part of the task, providing a response, is simplified.
The first part of the task, understanding the question, is also simplified in face-to-face interviews: indeed, if respondents have problems understanding a question, the interviewers can help them, explaining unknown terms or giving examples to illustrate and clarify the meaning of the question. Therefore, we do not expect large differences between different groups of respondents.
Nevertheless, the analyses of Alwin and Krosnick (1991) and Andrews (1984) suggest that age and education have some impact on the quality. Even for normal questions, the cognitive abilities of the respondents might affect the quality. Also, other factors, such as the capacity to concentrate, mental distractions or the motivation of the respondent, may lead to differences in quality: even if all respondents are ideally able to answer with a similar quality, in practice, some may not be motivated enough to provide a maximum effort. Some may be inattentive or may satisfice (Krosnick, 1999). Therefore, even if all respondents have the cognitive ability to reach the same level of quality, it may happen that some groups – low educated people – are more willing to satisfice than others — high educated people — which would lead to a different quality for the same question for different groups of respondents. So following previous results, we assume that:
H1a: Elder respondents produce lower quality responses in face-to-face surveys than younger respondents.
H1b: Less educated respondents produce lower quality responses in face-to-face surveys than more educated respondents.
In Web surveys, there are two main aspects that differ and may play a role in determining the quality.
First, Web surveys are self-completed, so the respondents have to do the entire task by themselves. They need to be able to read and understand what the questions mean. They need to understand how to give an answer and how to go to the next question. They need to keep themselves motivated to continue the questionnaire and not skip items. Such surveys are therefore much more demanding.
Second, compared to other self-completed modes, Web surveys require the use of a computer 2 and the Internet. This has both advantages and disadvantages. On the one hand, the branching, for example, that may be quite burdensome for the respondents in paper-and-pencil surveys, can be done automatically in Web surveys. Automatic checks can also be incorporated into Web surveys to substitute for some of the checks an interviewer could make. Some extra help may also be added more easily to Web surveys than to paper questionnaires; for instance, adding links opening windows with extra definitions. All these possibilities make the Web closer to a face-to-face interview than a paper questionnaire. On the other hand, Web surveys require more skills than paper-and-pencil questionnaires since the respondents have to be able to use a computer and the Internet.
How can these aspects of the Web surveys interact with respondent characteristics? Some authors defending the idea that a “digital divide” exists (Rhodes et al., 2003) argue that Web surveys incite more men and young people to participate, and on the contrary discourage women and older people. Besides this potential difference in participation, we want to see if once they have agreed to participate, we obtain differences in the quality of the answers of such subpopulations.
In Europe, we believe that nowadays women and men are on average able to understand normal questions without the help of an interviewer and have all on average reached the minimum degree of computer and Internet familiarity required to answer a Web survey.
However, we assume that the eldest respondents are not in general familiar enough with the Internet, such that for them completing Web surveys creates an additional burden and leads to more measurement error. So we expect the differences in quality between elder and younger respondents to be higher in Web surveys than in face-to-face ones.
Another variable of interest is the respondent’s education. Because of the self-completed aspect of Web surveys, we assume that the quality will be lower for the Web than for face-to-face interviews for respondents with a lower level of education, since the absence of an interviewer makes the task more difficult. At the same time, because people can choose the moment of the interview and can complete it at their home, we assume that the quality will be higher in a Web survey for people with higher level of education. Concerning the use of the computer and Internet, it can be seen as an extra burden for the respondents with lower levels of education. On the contrary, since it allows extra checks or the use of more friendly designs, it can improve the quality for higher educated respondents by lowering random error or increasing motivation. So to summarize, we propose the following hypotheses:
H2a: Women and men produce similar levels of quality in Web surveys (and a fortiori in face-to-face surveys).
H2b: The difference in quality between elder and younger respondents (with lower quality for the elder) is higher in Web than in face-to-face surveys.
H2c: The difference in quality between less and more educated respondents (with lower quality for the lower educated) is higher in Web than in face-to-face surveys.
Putting together these hypotheses and the fact that previous research does not find relevant differences in the average quality of face-to-face and Web surveys, it appears that an increase in one group should be compensated by a decrease in another, so we formulate one final set of hypotheses:
H3a: When switching from face-to-face to Web, the quality increases for the younger respondents and decreases for the elder.
H3b: When switching from face-to-face to Web, the quality increases for the higher educated respondents and decreases for the lower educated.
The hypotheses could be made more precise: for instance, topics of more interest to the respondents may lead to higher quality. The complexity of the question may also have an impact: for very basic questions, there is little reason to think that the quality depends on respondent characteristics. Nevertheless, it seems reasonable that mainly in self-completed modes, when the questions get more complicated, differences appear. The degree of social desirability could play a role too: if different education groups for instance attribute different levels of sensitivity to the same questions, then the level of social desirable answers may vary across groups, leading to more variations on the quality estimates for questions seen as differently sensitive for the different groups. But as mentioned earlier, the paper focuses on not very complex or sensitive questions.
Method
Getting the Quality Estimates
The multitrait-multimethod (MTMM) approach consists in repeating several questions (called “traits”) with several “methods” (Campell and Fiske, 1959). To avoid random variations due to the sample, the repetitions should be asked to the same respondents. To avoid possible changes in true opinions or attitudes, they should be asked in a short period of time, preferably in the same questionnaire to guarantee there is no possible communication of the respondents with other persons that could make them change their mind. However, if people are asked several times the same question in a very short period of time, this may lead to memory effect: respondents are not processing the question the second time but instead they are remembering what they answered and saying it again, adapting the answer to the scale if necessary.
Van Meurs and Saris (1990) show that after 20 minutes of similar questions, respondents usually do not remember their answer anymore. Therefore, the different methods should be proposed to the respondents with at least a 20 minutes interval to avoid memory effects. Since at least three methods are necessary for identifying the model, long questionnaires are required. This can increase the cognitive burden of the respondents and may also not always be possible in practice because of costs or time constraints.
That is why Saris et al. (2004) propose the split-ballot multitrait-multimethod (SB-MTMM) approach, which combines the MTMM with a split-ballot (SB) approach, meaning that respondents are randomly assigned to different groups, each group getting a different combination of only two methods.
The true score model proposed by Saris and Andrews (1991) is used. In this model, it is assumed that there is a “true score” Tij
, which is a function of the ith
trait Fi
(with a coefficient equal to the validity coefficient vij
) and of the jth
method Mj
(with a coefficient equal to the method effect mij
). Then, the observed variable corresponding to the ith
trait and the jth
method (Yij
) is expressed as a linear function of the true score Tij
. The slope corresponds to the reliability coefficient rij
, and the intercept to the random error component eij
associated with the measurement of Yij
. As a starting point, we assume that the traits are correlated with each other, but the methods are neither correlated with each other, nor with the traits, and the error terms are neither correlated with each other, nor with any of the independent variables.
This model allows us to separate systematic error (due to method effects) from random error and to estimate reliability and validity coefficients. The product squared of these coefficients is the total quality. This total quality for the ith trait and the jth method is denoted qij 2 = rij 2 * vij 2 .
The maximum likelihood estimation for multiple group 3 analyses of LISREL (Jöreskog and Sörbom, 1991) is used to estimate the model. The model is estimated separately for different gender groups, age groups and level of education groups. The basic model constrains the parameters to be invariant across all groups. The model is tested each time using JRule (Van der Veld et al., 2009), a software based on the procedure developed by Saris et al. (2009) that allows testing for misspecifications at the parameter level and using both type I and II errors.
The model is corrected (mainly releasing constraints of invariance across groups or adding extra correlation between two similar methods) until we get an acceptable model according to the JRule test for misspecifications. A list of the modifications made to the initial model is available online. 4
Using the Estimates to Test our Hypotheses
Since we consider different experiments, with each time several traits and methods, in two surveys and for different background groups, quite a lot of quality estimates are obtained. A table presenting the average quality for the different traits for each method and group can be found in Appendix I, a document available from the author and distributed by BMS-RC33 list.
Since it is difficult to draw conclusions directly from these estimates, to test our hypotheses and look at the impact of several potential influences on the quality, we run regressions with the quality estimates as a dependent variable.
We cannot run a unique regression with everything because it is the same data that is analyzed when cutting the sample into gender, age and education groups (dependence of the estimates), so we run one regression for each cutting variable.
As independent variables, we first include only the cutting variable (one dummy for men in the first one, one dummy for the elder respondents in the second one, two dummies, one for low and one for high level of education in the third one
5
), the mode of data collection (dummy for Web), and the interaction between the cutting variable and the mode. Thus, we obtain the three equations below. From now on, we refer to this first set of equations as “Reg1”.
In the second set of regressions (“Reg2” from now on), we add some independent variables to equations 3 to 5 that have been shown to have an impact on quality. They include the topic of the questions (dummy for each experiment), and three variables about the characteristics of the methods: the number of response categories (numerical), the number of fixed reference points 6 (numerical) and the kind of scales (dummy “IS” equals to one if the scale is Item Specific, 7 0 otherwise). See for example Saris and Gallhofer (2007) for more details (definitions of these terms, effects on the quality, etc.).
Data
European Social Survey (ESS) and Longitudinal Internet Studies for the Social Sciences (LISS) Panel
The data needed for our analyses has to have several characteristics. First, it is necessary to have repetitions of several questions in one survey for the same respondents in order to use the true score model. But all the characteristics of the question varying from one mode to the other can cause differences in the quality that could be confounded with mode effects. To avoid this potential source of difference, we should have the exact same wording for questions and answer categories in the different modes.
Such datasets are not so common but the ESS round 4 (2008/2009) and one questionnaire completed in December 2008 by LISS panel respondents can be used since in both datasets, similar SB-MTMM experiments are included. The ESS is done in 25 to 30 European countries every two years since 2002. The interview is conducted at the respondent’s home. 8 The LISS panel is a Dutch online panel based on probability sample. Respondents that agree to participate are provided with a computer and Internet access if they do not already have it. 9 Both samples are quite similar in terms of gender, age and education distributions (see for instance Revilla and Saris, 2010).
These datasets present some limits. First, the LISS panel is a Dutch panel only, so for comparison, we cannot use all the ESS data, but focus only on the Netherlands. Second, since the LISS respondents are members of a Web panel, they all have at least some minimal level of computer skills. It would be preferable to have respondents who have never used the Internet answer the Web survey since it is for such respondents that we expect the highest differences in quality.
However, these limits are not as problematic as they may seem. First, the Netherlands has high Internet coverage and, at the same time, has experienced a large decrease in face-to-face response rates, so it would be a good candidate for a switch in data collection approaches in the near future. Even if not representative of all European countries, it presents many common characteristics with the Nordic countries in its Internet coverage and response rates.
Second, the method of recruitment of the LISS panel members is such that even people without previous computer and Internet access are integrated into the panel. Since they are proposed questionnaires every month, even if they had no experience at the beginning, each time they get a bit more trained. But looking at the question about the frequency of use of the Internet, we see that still 7.37 percent of the LISS respondents use the Internet only once a month or less. So there is still a non-negligible part of the LISS respondents who may have a very limited level of computer skills. However, because of the LISS survey split-ballot design, for a given SB group in a given experiment, there are too few respondents using the Internet once a month or less to directly test the impact of using frequently Internet on data quality (Appendix II, a document available from the author and distributed by BMS-RC33 list).
Choice of the Variables
Once the dataset is decided, we do not have much freedom to select the first set of variables, which is the set for which we are going to compute the quality. Indeed, the surveys only count six MTMM experiments. Table 1 gives, for each one, details about the traits (ti ) and methods (Mi ) for which the comparison between the LISS and the ESS could be made.
Traits and methods for each of the 6 MTMM experiments
Ideally, each experiment would count three traits and each of the traits would be repeated using three methods. This is the case for the experiments about media, satisfaction and political trust. However, in the experiments about political orientation, social trust and left-right positioning, one or two of the traits are only measured with M2 and M3 (but not with M1 ): these traits are used for the estimation, but are not considered when looking at the results. Besides, for political orientation and left-right positioning, the third method varies between the LISS and the ESS: in these experiments, the questions asked using M3 are therefore not considered in the Results section.
The second set of variables consists of the variables used to make the splits. According to our hypotheses, we need variables to measure gender, age and education. Since these variables are used to split the samples in different groups for which the quality is computed, the variables cannot be continuous or even have a large number of categories. Because we think that the difference for age stays between really the older respondents and the others, we cut the sample into two subgroups. However, to get a sufficient number of observations in each group, we fixed the cutting age at 60, even if it would have been better to cut at a more advanced age (Appendix II, a document available from the author and distributed by BMS-RC33 list). Concerning education, we separated “low” (lower secondary or less), “middle” (upper secondary and post-secondary non-tertiary) and “high” (first and second stages of tertiary) levels of education. We made three categories to see the effects both of a low and a high education, and to see if the effect is progressive or if the opposition is between low, on one hand, and middle and high, on the other (what we expect), or between low and middle, on one hand, and high, on the other.
Results
Results for Gender (H2a)
Table 2 gives the results of the regressions with the quality for the different gender groups as dependent variable. The table also gives the regression coefficients when disaggregating the quality into reliability and validity coefficients, but only for the regressions with all the explanatory variables. The traits are treated separately for all these analyses. This allows us to have more observations: 156 for the regressions of gender and age, and 234 for the regression for education (because we split the data into more groups for education).
Estimates from different regressions’ models for gender
Note: *significant at 10 percent level; **significant at 5 percent level
IS = item specific; Fixedref = number of fixed reference points;
qual= quality, rel=reliability; val=validity.
Social trust is used as reference category (experiment with the smallest differences).
Table 2 indicates that there is neither significant impact for gender, nor for the interaction between gender and mode, when considering the quality, or when considering the reliability and validity coefficients separately. We can notice that in “Reg1”, where only the variables of main interest for us are included, no significant effects are found at all, and the R 2 is almost null. However, by including the topic and some question characteristics as independent variables, the R 2 increases quite a lot. The same is true for the regressions on validity and reliability separately. We have to be careful about the meaning of the R 2 and the tests of significance because they are associated with the number of observations, which is quite low in our analyses. So we should look at the size of the estimates too: for gender and for the mode, they are all really small. So overall, the results seem to support H2a.
Results for Age (H1a, H2b, H3a)
Table 3 is similar to Table 2, but provides the results for age.
Estimates from different regressions models for age
Table 3 shows that in the regressions for data quality, but also the ones for reliability and validity, the coefficient is neither significant for age, nor for the interaction between age and mode. This is true both when including only a few independent variables (Reg1) and when controlling for the topic and some question characteristics (Reg2). All the estimates for the variables of interest are almost zero. Only the topic and question characteristics have significant effects. Therefore, we can neither accept H1a, nor H2b.
Besides, Table 3 shows that the mode does not have a significant impact on quality, reliability or validity coefficients, and we have already stated that the interaction between age and mode is not significant, so H3a is also not supported.
Results for education (H1b, H2c, H3b)
The same information is displayed for the education analyses in Table 4.
Estimates from different regressions models for education
In Table 4, we see neither significant impact for education, nor for the interaction between education and mode. This is true when using quality as a dependent variable and when using reliability and validity coefficients. So H1b and H2c are rejected. Also, as for H3a, the results suggest that H3b does not hold.
Summary
In sum, the signs in the regressions (“Reg 2” in Tables 2, 3 and 4) of the coefficients for more than 60 years old (negative), low educated (negative) and Web (positive) seem to support some of our hypotheses. But in fact, all these estimates are rather small and none of the variables we are interested in – gender, age, education, mode of data collection and the interaction between the first three and the mode – has a significant effect on the quality. Therefore, we can conclude that in the data analysed there is no effect on the quality of having a Web instead of a face-to-face interview, that there is no effect of being a man instead of a woman, no effect of being above 60 instead of under 60, no effect of having a low or a high education instead of a middle one. The picture is similar when considering reliability and validity coefficients separately.
On the contrary, almost all the other explanatory variables (topics, item specific, number of answer categories and number of fixed reference points) have significant effects. Besides, the size of the effects is sometimes quite large: for left-right, it is around .20 in the three regressions. So it seems that the most determining variables for data quality are the properties of the questions.
Conclusion
Building on previous results comparing the quality in different modes of data collection, this paper wanted to go one step further, challenging the implicit assumptions made that the impact of the mode is similar for all the respondents, independent of their own characteristics. The fact that the average quality is similar in face-to-face and Web surveys is not sufficient to conclude that the mode has no impact on the quality of answers to survey questions. One of the reasons is that it is possible that quality is higher in Web surveys for some groups of respondents, whereas it is lower for others, leading to the same average. For this main idea, different hypotheses were proposed and tested.
The analyses show that when comparing one face-to-face survey, the ESS round 4, with its specificities (use of show-cards is an important one), to one Web survey completed by the LISS respondents, also with its specificities (a probability-based panel), no significant impact of the mode of data collection on quality is found, but also no impact of gender, age or education, and no impact from the interaction between the mode and these background variables. Therefore, it seems that hypothesis H2a (no differences between men and women in both modes) is supported by our results, whereas hypotheses H1a (lower quality for elder respondents in face-to-face), H1b (lower quality for low educated in face-to-face), H2b (highest difference in quality between age groups with the Web), H2c (highest difference in quality between education groups with the Web), and H3a and b (quality increases when switching from face-to-face to Web for younger and higher educated respondents; decreases for older and low educated) are not.
This suggests that the implicit assumption made in Revilla and Saris (2012) and Revilla (2010) was valid: at least for the different gender, age and education groups tested, the analyses do not show significant differences in quality for the two modes. This is an encouraging finding: it means that switching from one mode to the other can be done (if done “properly”) without disturbing the comparison of correlations between observed variables for these different groups. It also means that it is not necessary, if we are interested in the quality and in standardised relationships, to correct for differences in background between samples since this has no effect.
However, it could be argued that the nature of the data used for the Web survey is problematic. Because the LISS respondents are members of a panel, the part of the population that really has the lowest computer skills is missing from our data. This is one limit to the study. But the rarity of datasets with repetitions of different traits with different methods into the same survey, allowing the estimation of quality in the way we defined it, does not allow much freedom. Besides, it seems that there is a trend in different European countries towards the creation of Web panels and we think that if Web surveys are going to be used in the future for high quality surveys, it will probably be via Web panels. Our results in that sense are closer to what might be the future situation. It is important to note nevertheless that Web panels may be very different from each other: our study is based on a probability-based Web panel, and cannot be generalised to the vast majority of Web surveys conducted nowadays with opt-in panels but only to other probability-based Web panels that are making major efforts to obtain a representative sample and high quality data.
Footnotes
Acknowledgements and Funding
I am very grateful to Willem Saris and Peter Lynn for all their very helpful remarks and suggestions on previous drafts of this paper. This research received no specific grant from any funding agency in the public, commercial, or non-profit sectors.
