Abstract
Background and Objectives:
In order to assess website content effectiveness (WCE), investigations have to be made into whether the reception of website contents leads to a change in the characteristics of website visitors or not. Because randomized controlled trials (RCTs) are not always the method of choice, researchers may have to follow other strategies such as using retrospective pretest methodology (RPM), a straightforward and easy-to-implement tool for estimating intervention effects. This article aims to introduce RPM in the context of website evaluation and test its viability under experimental conditions.
Method:
Building on the idea that RCTs deliver unbiased estimates of the true causal effects of website content reception, I compared the performance of RPM with that of an RCT within the same study. Hence, if RPM provides effect estimates similar to those of the RCT, it can be considered a viable tool for assessing the effectiveness of the website content features under study.
Results and Conclusions:
RPM was capable of delivering comparatively resilient estimates of the effects of a YouTube video and a text feature on knowledge and attitudes. With regard to all of the outcome variables considered, the differences between the sizes of the effects estimated by the RCT and RPM were not significant. Additionally, RPM delivered relatively accurate effect size estimates in most of the cases. Therefore, I conclude that RPM could be a viable alternative for assessing WCE in cases where RCTs are not the preferred method.
Introduction
With the provision of web contents, different kinds of actor aim to achieve different kinds of objective. For example, websites are an instrument with which businesses market products or services (e.g., Dann & Dann, 2011; Hernández, Jiménez, & Martín, 2009); they are employed for public relations by political and governmental institutions and organizations (e.g., Gibson & Ward, 2000; Norris & Curtice, 2006); and they are used to educate students, specific target groups, or the broad public in formal and informal settings (e.g., Cox, 2013; Owston, 1997).
Generally, the production of website contents costs time and money. In order to legitimate the input of resources and find out whether the intended goals are achieved, the effectiveness of website contents needs to be evaluated. However, the mere investigation of presumed key drivers of website content effectiveness (WCE), such as website usability (e.g., Hallahan, 2001; Van der Merwe & Bekker, 2003), content quality (e.g., Mich, Franch, & Gaio, 2003; Zhang & von Dran, 2002), or more objective indicators like the number of clicks or the duration of website visits (e.g., Bauer & Scharl, 2000; Das & Turkoglu, 2009), is not sufficient for its accurate assessment. This is because using the term effectiveness of website contents implies that exposure to website contents actually has effects on the visitor.
Consequently, investigative work needs to be done into whether there is a change in the characteristics of website visitors that can be attributed uniquely to their exposure to website contents. In order to achieve this objective, randomized controlled trials (RCTs) are the most effective strategy for obtaining unbiased estimates of treatment effects. Therefore, they are frequently applied for assessing WCE (e.g., Braddy, Meade, Michael, & Fleenor, 2009; Klass & Crothers, 2000). Particularly, in basic research, web-based RCTs are considered to be reasonable alternatives to laboratory experiments (e.g., Reips & Krantz, 2010).
Yet in evaluation practice, RCTs may not always be the preferred method for assessing WCE. For example, intervention planners use website contents as treatments for specific target groups—such as people in certain medical conditions or other groups sharing specific characteristics (e.g., Ritterband et al., 2009; Salonen, Pridham, Brown, & Kaunonen, 2014; Ybarra & Bull, 2007)—and they are interested in the effects on these target audiences. Although RCTs can be conducted in such contexts when evaluators are involved in the planning of an intervention at an early stage, in practice, they can often only start evaluating when the intervention is already terminated (Meyer, 2011), which renders random assignment impossible. Admittedly, here too, one could assess WCE by conducting RCTs with an additional group of participants (e.g., recruited on various websites or via crowdsourcing) and by generalizing findings to the original target group. Yet this is only reasonable when the additional group is taken from a population fairly similar to the one the original target group was sampled from, which is difficult to achieve in some situations. If the additional group is sampled from a dissimilar population, treatment effects observed in this group may differ from those one would have observed in the original target group, and generalizing results may lead to flawed inferences.
Another obstacle to conducting RCTs is that they can be costly and time consuming (Rossi, Lipsey, & Freeman, 2004). When compared to the application of less rigorous methods for assessing WCE, higher costs of RCTs are usually determined by increasing effort in study planning, organizing, and programming. This may be particularly problematic in evaluation practice, where evaluators occasionally have to assess WCE within small-scale evaluation projects with tight budgets. 1 Moreover, conducting RCTs in the web context requires certain technical skills. For example, if website providers are interested in the effects of contents presented on their websites on visitors who voluntarily visit those websites, rather unambitious methods such as cross-sectional online surveys can be used by simply embedding survey announcements for recruiting and surveying visitors on the website. Yet when applying RCTs, evaluators need considerably more skills in programming, for example, in HTML or Perl (e.g., Fraley, 2004). Thus, technically less experienced evaluators may have difficulties in conducting RCTs, particularly, when working under pressure of cost and time.
Because of all these practical issues, it is reasonable to test the viability of alternatives to RCTs for assessing WCE, such as retrospective pretest methodology (RPM). This method provides evaluators with a straightforward tool for assessing WCE, has benefits as regards the costs of data collection, and can easily be implemented in situations where randomized control groups are not available. Recent studies have shown that RPM may be a viable option for evaluating interventions in areas such as education (Cantrell, 2003; Coulter, 2012; Moore & Tananis, 2009; Nielsen, 2011), homeland security (Pelfrey & Pelfrey, 2009), parenting (Hill & Betz, 2005; Pratt, McGuigan, & Katzev, 2000), or health-related quality-of-life research (Kvam, Wisløff, & Fayers, 2010; Zhang et al., 2012). In order to test whether RPM might also work in the web context, this study aims to present the methods and results of an empirical study in which I applied RPM for evaluating WCE and compared its performance with that of a simultaneously conducted RCT. Moreover, I discuss the results of the study, its limitations, and implications for future research. Beforehand, however, I first give a brief introduction to RPM.
RPM
As with traditional pretest–posttest designs, the basic idea of RPM is that intervention effects can be assessed by comparing participants’ level in terms of an outcome variable Y before and after participation. However, in contrast to traditional pretests, retrospective pretests are not administered until the intervention has been terminated. When completing a retrospective pretest, participants are asked to think back to their level in terms of Y prior to the intervention. The application of RPM for assessing WCE thus requires website visitors to assess the level in terms of Y that they had before they received the website information but to do so after already having received that information.
The fact that the preintervention state of a respondent is not assessed until the intervention is terminated has several advantages. First, an obvious advantage of RPM is that it can be used when participants can only be interviewed after the intervention, which renders traditional pretests impossible. Second, because researchers only have to collect data at one point in time from treated persons only, RPM saves effort and costs connected with data collection (Pratt et al., 2000). Moreover, the single administration of current and retrospective ratings may be more satisfying to participants (Lamb, 2005). The third advantage of RPM is that it prevents response shift bias (Bray & Howard, 1980; Howard, 1980; Howard & Dailey, 1979), which may be present in traditional pretest–posttest analyses. One important facet of response shift bias in evaluation practice is “defined as a program-produced [i.e., treatment-produced] change in the participants’ understanding of the construct being measured” (Pratt et al., 2000, p. 342). By giving respondents the opportunity to rate their past and current states in terms of an outcome variable Y at the same time, RPM prevents biased ratings based on different frames of reference. Fourth, another advantage of applying RPM is that it prevents distortions due to between-subject variation in variables like sex, race, education, or intelligence. As is the case with any method capitalizing on within-subject comparison, all time-invariant characteristics of participants are kept constant when calculating individual and average treatment effects.
The application of RPM is also associated with some weaknesses. For example, respondents may have problems in correctly remembering their preintervention state in terms of the outcome variable Y. Such recall bias may become greater as the length of recall time increases (Hill & Betz, 2005), for example, because of distortive factors like telescoping (Tourangeau & Bradburn, 2010). However, when using RPM for estimating the effects of rather short-term website content features, problems related to remembering should only play a secondary role. Second, the factors social desirability and implicit theories of change seem much more problematic in the context of assessing WCE. Effect estimates may thus be confounded by the “pervasive tendency of individuals to present themselves in the most favorable manner relative to prevailing social norms and mores” (King & Bruner, 2000, p. 80) and by the fact that “people who expect to change are likely to report that they have changed” (Hill & Betz, 2005, p. 505). Third, when using RPM, there is the risk that the pretest memory is a product of the treatment and is thus biased by the cause whose effects are purportedly being estimated. If this is the case, retrospectively estimated pretest values do not represent accurate estimates of respondents’ preintervention state, and treatment effects estimated as differences between current and retrospective ratings are confounded. Finally, Hill and Betz (2005) list some other factors such as emotion-related biases and effort justification bias that could distort effect estimates within the application of RPM.
As compared with effects estimated by traditional pretest–posttest designs, biases such as implicit theories of change, social desirability, or effort justification bias predominantly lead to the overestimation of treatment effects (Hill & Betz, 2005; Taylor, Russ-Eft, & Taylor, 2009). When effort justification bias is present, participants provide inflated change scores in order to justify the time and effort they spent participating in the intervention. In the presence of implicit theories of change, participants tend to express their assumption that the intervention must have had the intended effects, regardless of whether there has actually been any treatment effect at all. Finally, social desirability may lead to inflated effect estimates, either because participants are inclined to overstate effects in order to reflect positively on themselves or because they want to impress those who conducted the intervention.
Research Question and Hypotheses
The basic question behind this research is whether RPM is a viable option for evaluating the effectiveness of website content features in cases where RCTs are not the evaluators’ method of choice. Thus, with this study, I try to find evidence that could answer the question of whether RPM provides unbiased estimates of the effects of website content features.
In the empirical study presented subsequently, I therefore investigated the effects of two different website content features (an online video and an online text feature informing consumers about aspects of disability insurance) estimated by RPM on three different outcome variables denoted as “objective knowledge,” “subjective knowledge,” and “topic-related attitudes.” In general, I expect that applying RPM will lead to unbiased estimates of the effects of the video and the text feature on all of these variables when compared to an experimental benchmark result. More precisely, I tested six hypotheses (Hypothesis 1)–(Hypothesis 6), claiming that RPM provides unbiased estimates of the effect of the video on objective knowledge (Hypothesis 1), subjective knowledge (Hypothesis 3), and topic-related attitudes (Hypothesis 5), and the effect of the text feature on objective knowledge (Hypothesis 2), subjective knowledge (Hypothesis 4), and topic-related attitudes (Hypothesis 6).
Method
In order to test the proposed hypotheses, I compared the performance of RPM in estimating the effects of the two website content features with that of a simultaneously conducted RCT. The basic idea behind this procedure is that RCTs deliver unbiased estimates of the true causal effects of the exposure to website contents. Hence, if RPM provided effect estimates similar to those of the RCT, there would be evidence supporting my research hypotheses.
Treatments
Both website content features provided information about the importance of having disability insurance in case of occupational disability in Germany and aimed to enhance people’s knowledge of disability insurance and influence attitudes as regards this type of insurance. The first feature was a YouTube video of 270 s in length. 2 I chose this short video because its length was presumed to be typical of many videos on websites on the Internet (“Average Web Video Size Triples,” 2011). In the German-speaking video, the importance of disability insurance is illustrated using the example of an animated character named Klaus. Besides catchy pictures and statistical charts about the risk of becoming occupationally disabled, there is a narrator who explains important aspects of the topic. As the second content feature, I used a slightly modified transcript (632 words) of the narrator’s text from the above-mentioned YouTube video. On average, it took the participants about 170 s to read the text.
Data Collection
Participants were recruited from the pool of the German crowdsourcing enterprise WorkHub. WorkHub is an online contract labor portal whose members voluntarily carry out a predefined task. For completing a task—such as an online questionnaire—pool members receive a preassigned payment. Because of self-selection, recruited study participants were not representative samples of the population, that is, they participated either with the intention of earning money or because they were interested in the topic of disability insurance or both. Nevertheless, if random sampling from the population is not feasible, interviewing crowdsourcing members is a reasonable alternative to collecting data from university students because the reliability of such data is not presumed to be inferior to that of data obtained from students (Behrend, Sharek, Meade, & Wiebe, 2011). For participation, each participant received an incentive of about 1.7€ (about US$2.3).
In order to establish an unbiased benchmark for the RPM effect estimates and preclude selection bias, participants were randomly assigned to three conditions (video feature, text feature, and control group). None of the subjects were aware of the purpose of the study. They were only informed that the study was concerned with the issue of occupational disability.
Before completing the online questionnaires, constructed ad hoc, members of the first group watched the online video, members of the second group read the text online, and members of the control group did not receive any information about the topic at all. After having read the instructions and—except for the control group—received the information from the content features, members of all the groups were requested to (1) answer six questions that were part of a performance-based measure, capturing knowledge about occupational disability and disability insurance, (2) assess three statements capturing their subjectively perceived knowledge about occupational disability and disability insurance, and (3) rate 6 items measuring their attitudes toward three facets of disability insurance and becoming occupationally disabled. For the sake of readability, I refer to these constructs in the sections that follow as (1) objective knowledge, (2) subjective knowledge, and (3) topic-related attitudes. In order to preclude item order effects, item orders were randomized within the respective blocks (Bishop, 2008). All the questionnaires were issued in German.
Additionally, members of the video and text groups completed all the items again, but they did so under the assumption of not yet having watched the video or read the text. They were instructed as follows: Please forget everything you have seen/read in this survey about occupational disability and disability insurance. Think back to the situation before you watched the YouTube video/read the text on disability insurance. Try to remember how much you already knew about the topic and how your attitudes were constituted before you watched the video/read the text. Please now complete the questions and ratings that you have already completed again but be aware of the fact that you have to answer these questions as you would have answered them before you participated in this survey.
Sample Characteristics
Table 1 describes the distributions of four sociodemographic characteristics. In total, 208 respondents were recruited for participation in the study, 68 of whom were assigned to the YouTube video group, 72 were members of the text group, and 68 were part of the control group. No information about the distribution of characteristics in the population of crowdsourcing pool members was available, which is why the degree of similarity between the sample and the target population could not be determined.
Distributions of Demographic and Socioeconomic Characteristics.
Note. Haupt-/Realschule = German certificate of completion of compulsory basic secondary schooling and German intermediate school certificate; Abitur = German school examination approximately equivalent to the American Scholastic Assessment Test exam.
Measures
The first outcome variable of interest (objective knowledge) was the total score from a quiz consisting of six questions. This performance-based variable reflects an objective level of knowledge about occupational disability and disability insurance. Each question provided respondents with three alternative answers, of which only one was correct. The score was calculated by adding up the number of correct answers. Consequently, each respondent could achieve a maximum score of six.
The second outcome variable (subjective knowledge) was a 3-item scale reflecting self-reported, that is, subjectively rated knowledge about disability insurance. This information was gathered using 7-point rating-scale items, from 1 (disagree absolutely) to 7 (agree absolutely). I calculated the total scale score by adding up the single ratings of each person and dividing the sum by 3. The result then equaled the average score of the 3 items. The correlation between the two constructs objective knowledge and subjective knowledge was r = .50 for the current ratings (p < .001, two-tailed test) and r = .48 (p < .001, two-tailed test) for the retrospective ratings, both correlations indicating strong relationships when interpreted in terms of effect sizes. These numbers clearly indicate that respondents scoring highly in the quiz also tended to rate their subjective knowledge as high.
Finally, the third outcome variable (topic-related attitudes) consisted of 6 items that measured attitudes with regard to the importance of having disability insurance, the risk of becoming occupationally disabled, and the threat of having to take an inappropriate job. Each of these facets was assessed by 2 of the 6 items. Here too, 7-point rating-scale items from 1 (disagree absolutely) to 7 (agree absolutely) were used to record respondents’ assessments of the statements presented. I calculated the total scale score by adding up the six ratings for each individual and dividing the sum by 6. The average quiz and scale scores, the respective standard deviations, Cronbach’s α for subjective knowledge and topic-related attitudes, and the average variance extracted for these two constructs are presented in Table 2. The wordings of all items can be found in the appendix.
Descriptive Statistics of Outcome Variables.
Note. AVE = average variance extracted; Max = maximum; Min = minimum. Score of objective knowledge represents number of correct answers to six quiz questions. Scores of subjective knowledge and topic-related attitudes represent the average scale scores from 7-point rating scales (1 = disagree absolutely; 7 = agree absolutely).
Data Analysis
In the first step, I estimated effect sizes 3 (Cohen’s d) for the treatment effects of the content features estimated by the RCT, denoted as d (RCT), and RPM, denoted as d (RPM). Effect sizes d (RCT) were estimated on the basis of differences in mean score values between the treatment groups and the control group. I used Stata’s esize command for calculating effect sizes for independent samples in this case. Effect sizes d (RPM) were estimated on the basis of differences in mean score values between participants’ current and retrospective ratings in the video and text groups, respectively. Here, I also used Stata’s esize command and thus the original standard deviations of the current and retrospective ratings instead of the standard deviations of the difference scores. 4 I did this because using standard deviations of the difference scores would have led to an overestimation of effect sizes (Dunlap, Cortina, Vaslow, & Burke, 1996).
In the second step, I calculated the differences between the sizes of the effects of one treatment estimated by the RCT and RPM, denoted as Δ(d), by subtracting the size of the experimental effect from that of the treatment effect estimated by RPM. Thus, a positive value of Δ(d) stands for an overestimation of the estimated true effect by RPM, while a negative Δ(d) stands for an underestimation. The greater |Δ(d)|, the less viable RPM is for the estimation of causal effects of the treatments under study.
Finally, I investigated whether the differences Δ(d) significantly differed from zero. Because standard errors of Δ(d) could not be derived analytically, I followed M. Wood (2005) and used bootstrap resampling to establish confidence intervals (CIs) as a basis for statistical inference. For each of the dependent variables, I drew 100,000 random samples from the original sample. All the random samples were of the same size as the original sample and were drawn with replacement. I then reestimated all the effect sizes and the respective differences Δ(d) for each of the 100,000 samples. From the resulting sample distributions of Δ(d), I was able to derive percentile CIs.
Results
First of all, receiving the information from either the YouTube video or the text led to positive changes in objective knowledge, subjective knowledge, and topic-related attitudes, regardless of whether the RCT or RPM was used for estimation. The results of the comparison of the estimated effect sizes of RPM with those of the RCT are presented in Table 3. As regards the sizes of the single effects estimated by the RCT and RPM, results indicate that the video and the text feature induced strong effects on objective and subjective knowledge, no matter by what method effect sizes were estimated. In the case of topic-related attitudes, RCT estimates show effects of moderate strength with regard to the video and low strength with regard to the text feature, whereas RPM estimates show moderate effects with regard to both content features.
Comparison of Effects Estimated by RCT and RPM.
Note. d(RCT) = Effect sizes of effects estimated by randomized controlled trial. d(RPM) = Effect sizes of effects estimated by retrospective pretest methodology. CI = 95% percentile confidence intervals of Δ(d) based upon bootstrapped sample distributions (100,000 iterations).
When it comes to the differences, Δ(d), between the sizes of the effects estimated by the RCT and RPM, results show that using RPM led to a moderate overestimation of the effect of the video on objective knowledge and a moderate underestimation of the effect of the video on subjective knowledge. Moreover, RPM led to a slight underestimation of the effect of the text on subjective knowledge and a slight overestimation of the effect of the text on topic-related attitudes. The remaining two values of Δ(d), representing the differences in effect sizes with respect to the effect of the text on objective knowledge and that of the video on topic-related attitudes, are very small and do not indicate any meaningful over or underestimation of the experimental effects by RPM at all.
The results reported so far indicate moderate and slight over- as well as underestimations of the true effects by RPM. Yet it is not clear whether these differences are significant or not. In order to investigate this in more detail, a closer look at the bootstrap simulations conducted is necessary. The results in Table 3 show that all the estimated CIs span the zero value. Thus, none of the observed Δ(d) differ significantly from zero with α = 5%.
Discussion of Results
In the section that follows, I will focus on the differences Δ(d) because these are the relevant key measures for assessing whether my research hypotheses find empirical support or not.
First, the absence of significant differences between effect sizes estimated by the RCT and RPM clearly supports Hypothesis 1–Hypothesis 6. Having said that, relying solely on the significance of Δ(d) would not be a fair test of RPM performance because the widths of the CIs depend on sample size, which was not exceptionally high in my study. Therefore, it is reasonable to consider the values of Δ(d) for assessing the performance of RPM too.
To begin with, results show that RPM moderately overestimated the effect of the video on objective knowledge, Δ(d) = 0.50. This effect inflation might be explained by the circumstance that treatment group members’ retrospective ratings did not take into account the fact that they might have guessed a correct answer by chance before treatment exposure. By contrast, the probability that control group members would provide a correct answer by guessing was exactly one third, which means that some of the subjects gave correct answers only by chance. Yet this interpretation is contradicted by the fact that RPM did not lead to an overestimation of the effect of the text on objective knowledge, as one would expect in the presence of this kind of bias. Furthermore, effort justification bias or implicit theories of change (Hill & Betz, 2005) are potential reasons for overestimation by RPM in the case of the effect of the video. Put differently, members of the video group may have provided inflated effect estimates, either because they wanted to justify the effort they put into participating in the study or because they simply believed that the intervention must have had the intended effect, regardless of whether there had actually been any effect at all.
The reason for the greater value of |Δ(d)| in the case of the effect of the video on objective knowledge may also have to do with the nature of the treatments. For example, because watching the video took longer than reading the text, there may have been greater recall bias in the video group and it may have distorted the retrospective ratings. The greater value of Δ(d) may also have occurred because of different levels of respondents’ liking for the two content features. Members of the video group received an intricately designed and animated video, which most people probably like, whereas the text group received rather unattractive plain text. As a result, the video group may have provided inflated RPM estimates because of their greater liking for the video feature. Yet if such emotion-related bias had effectively inflated RPM estimates of the effect of the video on objective knowledge, one would expect a similar overestimation as regards the effects of the video on subjective knowledge. This is not the case, however, because RPM underestimates the effects of the video feature on subjective knowledge and it does so even more strongly than it underestimates those of the text feature.
There still remains the question of whether multiple choice tests such as the one used in this study are suited to the application of RPM at all. Besides, the threat of inflated RPM estimates due to nonaccounting for random guesses in retrospective answering, a different potential bias when using multiple choice tests might work in the opposite direction. Respondents’ retrospective answers may be biased because they confuse what they already knew about the topic before the treatment with what they learned during treatment exposure. Put differently, subjects might retrospectively believe that they already knew the answers to questions, although in fact they did not until they received the treatment. Following this argumentation, applying RPM would lead to an underestimation of the true treatment effects. Unfortunately, the evidence presented in this study is not sufficient for determining how exactly bias might affect retrospective ratings of objective knowledge tests. Therefore, this is an issue that should be addressed in more detail in future studies.
As regards the effects of the content features on subjective knowledge, RPM shows a moderate underestimation of the experimental effect in the case of the video feature, Δ(d) = −0.50, and a slight underestimation in that of the text feature, Δ(d) = −0.20. A reason for these deviations from RCT estimates may be found in the potential underestimation of subjective knowledge ratings in the control group, for example, because subjects in this group did not know the contents presented in the video and the text and may thus have thought that there was much more to learn about the topic than was actually presented by the content features. Another explanation may be that respondents in the treatment groups retrospectively overestimated their knowledge level prior to treatment exposure, for example, because they confounded what they knew about occupational disability before treatment exposure with what they learned about the topic while watching the video or reading the text. Regardless of what determined the deviations of RPM estimates from RCT estimates in this case, results indicate that RPM worked differently when estimating effects on subjective knowledge and estimating effects on objective knowledge. Because the current study does not provide evidence about what exactly may have caused these differences, answering this question should also be an issue in future studies.
When it comes to the effects on topic-related attitudes, results in Table 3 show that RPM—on average—delivered more accurate estimates than it did when effects on knowledge were estimated. One reason for this may have to do with the different natures of knowledge and topic-related attitudes. While the level of knowledge is comparatively easy to raise by providing study participants with new information, changing attitudes is more complex (W. Wood, 2000) and thus more difficult. This circumstance is reflected by the results of this study, which indicate that the effects of the content features on both knowledge measures were considerably stronger than they were on attitudes. Consequently, since attitudes were affected less by the contents provided, the fact that control group members did not know those contents—and may thus have thought that there was more to learn than was actually presented during treatment exposure—was probably less of a problem when comparing estimates of effects on attitudes by the RCT and RPM than it was when effects on subjective knowledge were compared. Moreover, in contrast to estimating effects on objective knowledge by RPM, estimating effects on attitudes was not confronted with potential bias due to respondents’ nonaccounting for correct answers they might have given by chance in a real pretest situation.
Finally, a closer look at Table 3 reveals that the overall performance of RPM was better when the text feature was used as a treatment. Compared to an average absolute deviation of |Δ(d)| = 0.35 in the case of the video feature, the average absolute deviation was only half as large with |Δ(d)| = 0.17 in the case of the text feature. Unfortunately, the results presented do not indicate why this difference exists. The reasons may lie in the higher degree of emotionality of the video and/or its longer duration.
To sum up, the absolute values of effect size differences Δ(d) support Hypotheses 2, 4, 5, and 6 because there were only minute or small differences between the effect size estimates of RPM and the RCT. With regard to Hypotheses 1and 3, the empirical support is a little weaker because of the moderate differences between the effect sizes estimated by the two methods. Having said that, it seems that typical biases of the kind that usually lead to inflated effect estimates—such as effort justification bias or implicit theories of change—did not play a major role in this study. Moreover, given the unsystematic variation in over- and underestimations of the effects of the RCT by RPM and the fact that none of the differences Δ(d) were significant, a considerable share of the observed Δ(d) presumably reflects random error.
Limitations and Implications for Future Research
There are some important limitations in the study that need to be discussed and these call for further research efforts. The first has to do with the restricted external validity of the results. Because participants received a monetary incentive, there may be self-selection bias in the study. Thus, the sample may not be representative of the general population or the actual audience of the website content features employed. In order to draw more generalizable inferences, researchers should accurately define the populations of interest and randomly sample study participants from those populations.
Moreover, the study took place under a certain set of conditions, which make it difficult to generalize the study findings to other research settings with different conditions. For example, there is only a short interval for respondents’ remembrance of pretest values because the visual and literary features presented were short. Thus, they may not be comparable with longer content features employed in the context of some informational campaigns. Consequently, the results cannot simply be transferred to causal studies with temporally longer outcomes because recall bias may increase with length of recall time (Hill & Betz, 2005). Therefore, future research should assess whether RPM leads to unbiased estimates of WCE when using content features of longer duration.
Furthermore, the treatment is probably one about which few respondents actually care so that their remembrance is not likely to be influenced by what they saw or read. However, this may be different with regard to topics that are more relevant to study participants, because their remembrance of pretreatment values may be affected by cognitive or affective reactions to what respondents see, hear, or read during treatment exposure. Consequently, generalizing to interventions that are more central to respondents’ motivations or interests is difficult. Future research could address this issue by choosing more relevant treatments and test whether RPM still delivers unbiased effect estimates. Having said that, this would require researchers to know which treatments were in fact relevant to study participants before the study had even started.
In addition to the limitations presented so far, there are some issues with regard to the questions and items used for measuring the dependent variables. To begin with, the whole questionnaire was constructed ad hoc. This means that all of the items were developed with the purpose of measuring the outcome variables relevant to this study. I did this because I could not identify any existing scales suitable for measuring the constructs of interest. Due to external constraints, I did not have the opportunity to conduct a comprehensive pretest either. In order to prevent the use of potentially improper instruments, future research should try to employ pretested and validated scales for measuring the outcomes of interest.
Moreover, measuring topic-related attitudes may have been affected by respondents’ psychological reactance to the content features (Brehm, 1966). More specifically, respondents may have felt pressurized because of the content features’ argumentation in favor of taking out disability insurance. As a consequence, attitudes of theirs that were contrary to those presented in the content features may have been strengthened and may thus have provided biased ratings of the attitude scale. In order to reduce potential reactance, future studies could work with content features that are less persuasive as regards their central message.
Aspects of social desirability may have played a role in measuring both the subjective knowledge and the attitude scale. Yet the comparatively high correlation between the current objective and subjective knowledge measures (r = .50) works against social desirability as a source of bias in measuring subjective knowledge because respondents rated their subjective knowledge in accordance with their performance in the multiple choice test. Thus, subjects who performed poorly or averagely in the multiple choice test did not pretend to know much about the topic on the subjective knowledge scale, as one would expect them to do in the presence of social desirability. Moreover, previous research has shown that social desirability is more likely to affect the measurement of self-reports in interview situations involving personal contact between the interviewer and the interviewee (Heerwegh, 2009; Kreuter, Presser, & Tourangeau, 2008). Thus, I assume that social desirability only played a minor role in this study, particularly as regards measuring subjective knowledge. Nevertheless, in order to deal with this issue appropriately, researchers should try to detect bias by using social desirability scales or apply methods capable of preventing social desirability bias (Nederhof, 1985).
Finally, as regards question order, completing objective measures before self-reports may have affected respondents’ retrospective self-ratings. Hoogstraten (1985), for example, found that having subjects complete a performance test prior to treatment exposure led to a lower value of retrospectively rated subjective knowledge after the intervention. As regards my study, it seems that subjects adjusted their retrospective ratings of subjective knowledge to the retrospective completion of the objective measure, which is indicated by the high correlation (r = .48) between these measures. This means that subjects who retrospectively stated that they would have answered the multiple choice items incorrectly prior to treatment exposure also tended to rate their perceived subjective knowledge as being lower before treatment exposure. In order to investigate such a potential calibration effect in more detail, future studies should establish an additional group in which respondents only rate subjective knowledge but do not complete an objective measure. If this is not feasible, randomizing the order of constructs might also help to create a basis for investigating this effect.
Conclusion
The research reported in this article was devoted to investigating the viability of RPM for assessing the effectiveness of website content features. The results of a comparison between experimental and RPM estimates showed that RPM was capable of delivering comparatively resilient estimates of the effects of a YouTube video and a text feature on different knowledge levels and attitudes. RPM may thus be an interesting alternative for assessing WCE. RPM should not be considered as a substitute to the application of RCTs but as an alternative strategy when situational constraints impede the application of RCTs. Because of the higher internal validity of RCTs, the potential for bias from retrospection required for RPM should always be weighed against the infeasibility of an RCT.
Although this study could demonstrate that RPM worked quite well for assessing the effects of two distinct content features, these results only hold under the conditions examined and cannot simply be generalized. As with other tests of nonexperimental designs, it is difficult to draw generalizable conclusions about the conditions under which a method provides unbiased estimates of treatment effects. Therefore, I recommend that the viability of RPM for evaluating WCE and other types of intervention be further investigated with more differentiated settings, treatments, and populations. A relatively convenient way of doing this could be the application of so-called “opportunistic experiments,” “a type of RCT that studies the effects of a planned intervention or policy action; by contrast, other types of RCT examine an intervention or policy action that is implemented for the research study” (Resch, Berk, & Akers, 2014, p. 1). Put differently, any RCT conducted for testing the effects of a planned intervention (such as a website) may serve as a basis for generating evidence to answer the question of whether it is possible to approximate experimental findings by the application of nonexperimental designs, including RPM.
Footnotes
Appendix
| Construct/Item |
| Objective knowledge 1: How many citizens of the Federal Republic are affected by occupational disability in the course of their life? |
| “Approximately every 12th citizen” |
| “Approximately every 8th citizen” |
| “Approximately every 4th citizen” (c) |
| Objective knowledge 2: What is the difference between permanent disability insurance and a reduced earning capacity pension? |
| “A person who is occupationally disabled gets more money if he or she has permanent disability insurance.” (c) |
| “There’s no difference, they’re both the same.” |
| “The employee gets more money with a reduced earning capacity pension.” |
| Objective knowledge 3: On average, what percentage of people between their 25th and 40th year become occupationally disabled? |
| “About 10%” |
| “About 20%” |
| “About 30%” (c) |
| Objective knowledge 4: What is the best time to take out permanent disability insurance? |
| “It’s similar to the Riester retirement scheme in as much as it doesn’t make any difference when you take it out.” |
| “The best thing is to take it out as soon as you start working or even before that, as the contributions are at their lowest then.” (c) |
| “The later the better, as otherwise you may make unnecessary contributions, which might not be paid out in full, even in a case of occupational disability.” |
| Objective knowledge 5: How much does an employee get from the state if he or she is entitled to a full reduced earning capacity pension? |
| “Approximately 20% of his former gross wage” |
| “Approximately 35% of his former gross wage” (c) |
| “Approximately 50% of his former gross wage” |
| Objective knowledge 6: What factors can lead to occupational disability? |
| “According to the law, occupational disability can only result from physical injuries and/or illnesses.” |
| “According to the law, occupational disability can only result from mental illnesses.” |
| “Occupational disability can result from either physical injuries/illnesses or mental illnesses or both.” (c) |
| Subjective knowledge |
| “I know what permanent disability insurance is for.” |
| “I know the difference between private permanent disability insurance and the state reduced earning capacity pension.” |
| “I know how high the average risk of occupational disability is.” |
| Attitudes |
| “I think that today it’s important to have permanent disability insurance.” |
| “I think permanent disability insurance makes good sense.” |
| “A person who doesn’t want to take any risks should take out private permanent disability insurance.” |
| “In my opinion, everyone should take out permanent disability insurance to protect himself against the negative consequences of illness.” |
| “Taking out permanent disability insurance is important, because otherwise you can be compelled by the state to do just any kind of work.” |
| “Only a person with permanent disability insurance need not worry about having to take on a job, which is absolutely inappropriate.” |
Note. (c) = correct answer.
Acknowledgment
I wish to thank my colleagues Hansjoerg Gaus, Joern Gruendler, Joerg Rech, and Wolfgang Meyer as well as three anonymous reviewers for their invaluable feedback and helpful comments on earlier versions of this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
