Abstract
Despite the growing popularity of online opt-in samples in criminology, recent work shows that resultant findings often do not generalize. Not all opt-in samples are alike, however, and matching may improve data quality. Replicating and extending prior work, we compare the generalizability of relational inferences from unmatched and matched opt-in samples. Estimating identical models for four criminal justice outcomes, we compare multivariate regression results from national matched (YouGov) and unmatched (MTurk) opt-in samples to those from the General Social Survey (GSS). YouGov coefficients are almost always in the same direction as GSS coefficients, especially when statistically significant, and are mostly of a similar magnitude; less than 10% of the YouGov and GSS coefficients differ significantly. By contrast, MTurk coefficients are more likely to be in the wrong direction, more likely to be much larger or smaller, and are about three times as likely to differ significantly from GSS coefficients. Matched opt-in samples provide a relatively inexpensive data source for criminal justice researchers, compared to probability samples, and also appear to carry a smaller generalizability penalty than unmatched samples. Our study suggests relational inferences from matched opt-in samples are more likely to generalize than those from unmatched samples.
In an increasingly web-based, low-response-rate world, the use of online surveying has dramatically increased in criminological research (Thompson & Pickett, 2019). Some studies have used unmatched opt-in samples, such as Amazon Mechanical Turk (MTurk), which allow nearly any of a platform’s users to answer the online questionnaire (e.g., Barnum & Solomon, 2019; Herman & Pogarksy, 2020; Miethe et al., 2019; Pickett et al., 2013; Seigfried-Spellar et al., 2017). These are convenience samples (Landers & Behrend, 2015), but they provide a relatively inexpensive method of exploring many theories and ideas and have the potential to yield a diverse, nationwide group of respondents (Weinberg et al., 2014). The risk with surveying opt-in respondents, of course, is that the resultant findings may not generalize to any broader population. Recent research suggests that observational findings from such samples often fail to generalize (Thompson & Pickett, 2019), even though experimental results tend to be externally valid (Coppock, 2019; Mullinix et al., 2015).
Another option, however, and one that is quickly becoming the most commonly utilized online sampling methodology in both criminological and sociological studies (e.g., Enns and Ramirez, 2018; Filindra & Kaplan, 2017; Haner, Cullen, et al., 2019; Lee et al., 2020; Lehmann & Pickett, 2017; Schutten et al., 2020; Simmons, 2017; Socia et al., 2019), is matched opt-in samples, such as those from YouGov. The objective of sample matching is to increase generalizability by adjusting for variables that affect selection—a model-based instead of a design-based approach to inference (Ansolabehere & Rivers, 2013). Model-based inference depends on a host of assumptions, which, if met, allow researchers to render selection ignorable through modeling adjustments (Mercer et al., 2018).
Recently, Thompson and Pickett (2019) examined the generalizability of both univariate estimates and relational inferences from online crowdsourced samples, such as MTurk, and unmatched online opt-in samples, such as SurveyMonkey Audience. As expected, they found significant differences in sociodemographic characteristics and attitude prevalence between the online nonprobability samples and those of the GSS sample. They also found that the regression coefficients for predictors of four criminal justice attitudes—views on the death penalty, acceptance of police use of force, fear of crime, and spending on law enforcement—often differed significantly. Ultimately, Thompson and Pickett (2019) concluded that these types of nonprobability online samples produce regression results that are “normally in the same direction as the GSS coefficients, especially when they are statistically significant, but [that] differ considerably in magnitude” (Thompson & Pickett, 2019, p. 1).
Thompson and Pickett’s (2019) findings are elucidating, but they provide no information about matched opt-in samples, such as YouGov. One major drawback to obtaining matched samples is the larger price tag that accompanies them, relative to unmatched opt-in samples. Therefore, an important question is whether matched opt-in samples are more generalizable than unmatched online samples. Specifically, are matched opt-in samples more likely than unmatched samples to produce relational inferences comparable to high-response rate, probability-based surveys, such as the General Social Survey (GSS)? We address this research question in the current research note using a matched YouGov sample and an unmatched MTurk sample. Using the same methodology and framework as Thompson and Pickett (2019), our study examines how regression results in the YouGov and MTurk samples compare to those in the General Social Survey (GSS) for four criminal justice outcomes. In so doing, our analysis both replicates and extends Thompson and Pickett’s (2019) findings.
The Matched Sampling Design
To rely on sampling theory alone to make generalizations from a sample to a population, the sample must be drawn probabilistically from a sampling frame that has complete coverage (Groves et al., 2009). However, online opt-in sampling (e.g., MTurk, SurveyMonkey) is normally not random (Mercer et al., 2018), and never has complete or even near complete coverage—it only includes Internet users who are signed up on a specific survey platform (Baker et al., 2013; Callegaro et al., 2015). Thus, it risks bias in relationships if selection (S) into the opt-in sample is a collider variable between unmeasured (U) predictors of the dependent variable (Y) and the included regressors (X) (i.e., X → S ← U → Y), a situation Thompson and Pickett (2019, p. 5) term “confounded sampling.” (It also risks bias if the dependent variable causes selection.) The matched sample strategy seeks to eliminate bias due to confounded sampling by adjusting—through matching and weighting—for variables that influence selection, thus yielding a sample that approximates a simple random sample of the population (Ansolabehere & Rivers, 2013; Mercer et al., 2018).
In general, matched opt-in surveys (e.g., YouGov) use a multistage sampling process that involves: (1) building a synthetic sampling frame (SSF) from large, high-quality, probability-based samples (e.g., American Community Survey), which is intended to represent the target population, (2) using this SSF in conjunction with matching to select respondents from an existing pool of opt-in panelists so that the sampled panelists are similar to those in the SSF on a number of specific covariates (matching variables), and (3) using weighting (propensity scoring and poststratification) to adjust for imperfect matching (Ansolabehere & Rivers, 2013). In doing so, theoretically, sample selection bias is reduced or eliminated.
The specific assumption underlying matched sampling is one of conditional exchangeability (or ignorability)—specifically, that sample selection is unrelated to the outcome variable after conditioning on observables (Baker et al., 2013; Mercer et al., 2018). In other words, the assumption is that any relationship between sample selection and the outcome variable(s) is spurious, due to common causes of both, such as respondent demographics, and that adjusting for these confounders through matching and weighting renders selection independent of the outcome (Thompson & Pickett, 2019). The reasonableness of this assumption will vary across different outcomes, for two reasons. First, different outcomes will share different common causes with selection, and not all of those confounders may be observed. Second, some outcomes, especially those pertaining to altruism and social support, may have a causal effect on selection. Even if the assumption is correct, the adjustment (matching and weighting) variables still “must be the correct variables for ensuring conditional exchangeability, and the panel must be able to supply respondents that are close matches to each case in the SSF” (Mercer et al., 2018, p. 261). When the assumption is wrong, and the outcome causes selection, or when it is correct, but the adjustment process (matching, weighting, or statistical control) fails to condition on the correct confounders, survey estimates will be biased, because sample selection will be related to the outcome variable (Mercer et al., 2018; Thompson & Pickett, 2019).
The good news is that the matching methodology produces election polling estimates that are similar to other large probability-based samples (e.g., American National Election Survey [ANES]) and random digit dial phone surveys, which suggests that this methodology may address the ignorability concern sufficiently for those outcomes–and at a fraction of the cost of large probability-based surveys (Ansolabehere & Rivers, 2013; Vavreck & Rivers, 2008). Additionally, the PEW Research Center’s (Kennedy et al., 2016) analysis of eight online vendors finds that sophisticated sampling and weighting methods, such as those used by YouGov (Rivers, 2016), result in the “smallest average estimated bias” of the samples that were analyzed. However, as Baker et al. (2013, p. 95) caution, “no single set of covariates can be expected to correct all the bias in a full range of survey topics, and the number of covariates needed for a given survey may be quite large.” The issue is that the specific confounders—the variables related both to selection and to the outcome of interest—will vary depending on the survey topic (Thompson & Pickett, 2019). Simmons and Bobo (2015), for example, found evidence suggesting YouGov’s sample matching method eliminated bias in regression models predicting egalitarianism, but not in those predicting racial resentment.
Notably, researchers have used matched samples to explore a range of criminal justice attitudes, including, but not limited to, attitudes toward the death penalty and other punitive policies (Lehmann & Pickett, 2017; Norris & Mullinix, 2019; Simmons, 2017), gun control (Filindra & Kaplan, 2017; Haner, Cullen, et al., 2019), sex offenders (Social & Harris, 2016; Socia et al., 2019), specialty courts (Thielo et al., 2019), prison privatization (Enns & Ramirez, 2018), terrorism (Haner, Sloan, et al., 2019; Haner et al., 2020), school shootings (Lee et al., 2020; Schutten et al., 2020), police-community relations (McManus et al., 2019), police brutality (Graham et al., 2020), body-worn cameras (Graham et al., 2019), and offender redeemability (Burton et al., 2020). Still lacking, however, is evidence about whether sample matching effectively eliminates selection bias in opt-in samples for criminal justice attitudes. The current study addresses this research void. To our knowledge, our study is the first to examine the generalizability of regression results from matched opt-in samples in relation to criminal justice attitudes.
Methodology
Data
The GSS data in this study come from the 2018 sample because it was the most recent available. In order to collect a nationally representative sample of respondents, the GSS uses a multistage cluster sampling design, which selects participants from primary sampling units. These respondents complete face-to-face interviews, which are conducted by the National Opinion Research Center (NORC). In the 2018 GSS sample, 2,348 respondents were surveyed; however, the full analytic sample in this study ranges from N = 1,070 to N = 1,496, both because the GSS only asks roughly two-thirds of the sample the 200 “core” questions and because of item nonresponse.
The YouGov data were collected between March 7th and 16th, 2020. YouGov initially surveyed 1,023 respondents, who were matched to an SSR based on gender, age, race, and education. This SSR was constructed through stratified sampling from the 1-year American Community Survey sample (2017). Matching yielded a sample of 1,000 respondents, which was then weighted to the SSR using propensity scores, after which the weights were post-stratified. Overall, the analytic sample ranges between N = 599 to N = 719 due to item nonresponse and “don’t know” responses.
The MTurk data were collected between March 20th and 21st, 2020. Respondents were limited to those who had completed over 500 HITs (human intelligence tasks), had a 95% or higher approval rating, and were aged 18 or older in order to capture high-quality respondents (Peer et al., 2014). In total, 845 respondents completed the survey; however, the analytic sample ranges between N = 663 to N = 772 due to item nonresponse and “don’t know” responses.
Dependent Variables
In this study, we analyzed four dependent variables, which used the exact same wording and response options across all three samples. They measure (1) fear of crime, (2) law enforcement spending preferences, (3) global attitudes towards police use of force, and (4) death penalty support. The exact item wording and response options can be found in Table 1.
Dependent Variable Item Wording and Response Options.
Independent Variables
Based on previous public opinion research on crime and justice (Brown & Socia, 2017; Cochran & Sanders, 2009; Unnever & Cullen, 2010), sociodemographic and attitudinal predictors were included in these models. In accordance with Thompson and Pickett’s (2019) predictors, race (0 = Non-White, 1 = White), ethnicity (0 = Non-Hispanic/Latino, 1 = Hispanic/Latino), sex (0 = female, 1 = male), age (in years) 1 , education (0 = less than a Bachelor’s degree, 1 = Bachelor’s degree or higher), and income were included. Income was measured on a 16-point scale ranging from 1 = “less than $10,000” to 16 = “$500,000 or more” and was treated continuously. Respondents’ political ideology was measured categorically (1 = Liberal to 3 = Conservative), and “Liberal” is the reference group in the models.
Finally, as well as being used as a dependent variable, fear of crime is used as an independent variable in the models examining policy attitudes, for two reasons. First, since at least the 1980s, theoretical and empirical scholarship on public opinion has considered fear of crime—under a utilitarian model of attitudes—to be a potential cause of punitiveness and support for aggressive policing (Baker et al., 2016; Kleck & Jackson, 2017; Langworthy & Whitehead, 1986; Silver & Pickett, 2015). Second, including fear as a predictor of policy attitudes replicates Thompson and Pickett (2019), increasing the comparability of our findings to theirs, and produces identical model specifications across the remaining dependent variables.
Analytic Strategy
For the analysis, the GSS and YouGov samples are weighted using the provided weights. 2 The MTurk sample is not weighted, because MTurk samples are rarely weighted in studies that use them (Mullinix et al., 2015) and because Thompson and Pickett (2019) found that weighting MTurk samples did not help. In line with Thompson and Pickett’s (2019) analysis, we examine the univariate statistics for the demographics and outcome variables of these three samples. Next, we estimate a total of four comparable models (three logistic regression models for each FEAR, POLHITOK, and CAPPUN, and one ordinal regression model for NATCRIMY) between the samples using all available predictors for the relevant dependent variable. Following Thompson and Pickett (2019), we treated “Don’t know” responses as missing values. Furthermore, in line with Thompson and Pickett’s (2019) methodology, and with methods used in other studies (Open Science Collaboration, 2015), we compared coefficients’ directions, whether the opt-in sample’s coefficient falls within the GSS’s 95% confidence interval, and vice versa, and whether the coefficients in the samples differ significantly from each other. To test whether coefficients differ significantly, we used the Paternoster et al. (1998) method. 3
Results
The descriptive statistics for the samples can be seen in Table 2. Like Thompson and Pickett (2019), we find that the demographic characteristics of the opt-in samples differ from those of the probability sample. However, our study shows that the differences are much smaller for the matched sample. The YouGov sample is significantly older (+2.32 mean years), has a higher income (+.26 mean), is more conservative (+5.71%), and has fewer Whites (−4.82%), Hispanics (−3.08%), and college graduates (−4.68%) than the GSS sample. But these differences pale in comparison to those between the MTurk and GSS samples, which are both more numerous and far larger in size. Compared to the GSS, the MTurk respondents are much younger (−8.77 mean years), much more likely to be male (+12.67%), much more likely to be college graduates (+24.08%), much less likely to be Hispanic (−10.74%), and much more liberal (+21.91%). They also have a significantly a lower average income (−1.17 mean) and are more likely to be White (+3.83%).
Descriptive Statistics.
SD = standard deviation.
Weighted data.
p < .05. **p < .01. ***p < .001 (two-tailed) for difference from GSS.
Similar to Thompson and Pickett (2019), we also find significant univariate differences in criminal justice attitudes between the GSS and opt-in samples. Once again, however, the differences are generally larger for the unmatched than the matched sample. The opt-in respondents wanted to spend less on law enforcement (YouGov, +3.42%; MTurk, +11.7%), were more afraid of crime (YouGov, +7.56%; MTurk, +4.01%), were more supportive of police use of force (YouGov, +2.44%; MTurk, +4.89), and diverged in death penalty support (YouGov, +1.65%; MTurk, −15.51%). With one exception (fear of crime), the differences for the MTurk sample are more than twice as large as for the YouGov sample.
Turning to the multivariate analyses, Figure 1 provides the odds ratios (OR) and confidence intervals for the models estimating fear of crime in the GSS, YouGov, and MTurk samples. Compared to the GSS sample, the YouGov sample has six coefficients in the same direction, and two in the opposite direction; the MTurk sample has seven coefficients in the same direction and one in the opposite direction. Two of the YouGov coefficients (for Hispanic and Male), but three of the MTurk coefficients (for Hispanic, Male, and Moderate), fall outside of the respective GSS confidence interval. Only one YouGov confidence interval (for Male) fails to capture the respective GSS coefficient, but the MTurk sample has two failures to capture (for Male and Moderate). None of the YouGov and GSS coefficients differ significantly, per the Paternoster et al. (1998) test, but one MTurk coefficient (for Male) is significantly different from its GSS counterpart. Thus, the YouGov sample slightly outperforms the MTurk sample for this particular outcome variable, fear of crime.

Odds ratios and 95% confidence intervals for logistic regression: FEAR.
Figure 2 presents the OR results for support for law enforcement spending. Compared to the GSS results, seven of the YouGov coefficients are in the same direction, but only six are in the MTurk sample. Only one YouGov coefficient (for Income) falls outside the GSS confidence interval, but three do in the MTurk sample (for White, Male, and Education). None of the YouGov confidence intervals exclude the respective GSS coefficient, but three of the MTurk confidence intervals do. Per the Paternoster et al. (1998) test, none of the coefficients differ significantly between the YouGov and GSS samples. Between the MTurk and GSS samples, however, two coefficients (for White and Male) differ significantly. Thus, the YouGov sample outperforms the MTurk sample for this outcome variable, spending preferences.

Odds ratios and 95% confidence intervals for ordinal regression: NATCRIMY.
Figure 3 provides the OR estimates for acceptance of police use of force. Compared to the GSS results, eight of the YouGov coefficients, but only four of the MTurk coefficients, are in the same direction. Only three YouGov coefficients (for White, Age, and Income) fall outside the GSS confidence intervals, but six MTurk coefficients do (for White, Hispanic, Age, Education, Income, and Fear). Similarly, twice as many MTurk confidence intervals fail to capture the respective GSS coefficient compared to the YouGov confidence intervals. The difference in coefficients is statistically significant for three variables (White, Age, and Income) in the YouGov sample, but is significant for five variables (White, Hispanic, Education Income, and Fear) in the MTurk sample. For this outcome, then, the YouGov sample does almost twice as well as the MTurk sample.

Odds ratios and 95% confidence intervals for logistic regression: POLHITOK.
Figure 4 provides the OR estimates for capital punishment attitudes. Compared to the GSS sample, the YouGov sample has five coefficients in the same direction and four in the opposite direction, whereas the MTurk sample has four coefficients in the same direction and five in the opposite direction. Four of the YouGov coefficients (White, Hispanic, Age, and Conservative) fall outside of the GSS confidence interval, whereas six of the MTurk coefficients do (White, Hispanic, Male, Education, Conservative, and Fear). In the YouGov sample, only one confidence interval (for Conservative) fails to capture the respective GSS coefficient, whereas there are five failures to capture in the MTurk sample (for White, Male, Education, Conservative, and Fear). Per the Paternoster et al. (1998) test, none of the YouGov coefficients differ significantly from its GSS counterpart, but one of the MTurk coefficients did (Education). Overall, though, the YouGov sample slightly outperforms the MTurk sample for this specific outcome variable, death penalty support.

Odds ratios and 95% confidence intervals for logistic regression: CAPPUN.
Table 3 summarizes all of the regression results from all of the models. The results are quite clear about which opt-in sample is preferable and are consistent no matter which comparison we focus on. Let us start with the directional findings. The YouGov coefficients are more likely than the MTurk coefficients (71.43% vs. 60.00%) to be in the same direction as the GSS coefficients. The YouGov advantage remains when we just look at the GSS coefficients that are statistically significant: 86.67% of the YouGov coefficients are in the same direction, compared to 66.67% of MTurk coefficients. And it remains when we just look at the coefficients that are statistically significant in the opt-in samples: 100% of the significant YouGov coefficients are in the same direction as the GSS coefficients, compared to 80% of the MTurk coefficients.
Summary of Findings.
What about the magnitude of the coefficients? Which opt-in sample produces coefficients that are more similar in size to those in the probability sample? The advantages of the matched YouGov sample are even larger here. The YouGov coefficients are more likely than the MTurk coefficients (71.43% vs. 51.43%) to fall within the respective GSS confidence interval. The YouGov confidence intervals are also more likely than the MTurk confidence intervals (82.86% vs. 42.86%) to capture the respective GSS coefficient. Perhaps most notably, the YouGov coefficients are less likely than the MTurk coefficients (8.57% vs. 25.74%) to differ significantly from the GSS coefficients. Put another way, the MTurk coefficients are more than three times as likely as the YouGov coefficients to be significantly larger or smaller than the GSS coefficients. We also compared the absolute differences between the GSS coefficients and those in the two opt-in samples. Once again, the matched YouGov sample did better than the unmatched MTurk sample. The average absolute difference was 36% larger in MTurk than YouGov, indicating that the matched YouGov sample produced coefficients more similar in size to those in the GSS.
Finally, we examined how the absolute differences varied across the dependent and independent variables. Two notable findings emerged. First, among the dependent variables, the absolute differences were largest by far for POLHITOK. This was true in both MTurk and YouGov, and suggests that opt-online samples may provide particularly inaccurate estimates of the correlates of attitudes toward police use of force. Second, in both opt-in samples, the absolute differences in coefficients across samples were especially large for race and/or ethnicity. This is highly consistent with past research. Kennedy et al. (2016, p. 4), for example, compared findings across nine opt-in panels to those from a probability sample. They found that: (1) univariate “estimates for Hispanics and blacks show the largest biases of all subgroups” (p. 15), and (2) “marginal effects [multivariate estimates] associated with race and ethnicity are rarely correct” (p. 22). The explanation is not so much that Blacks and Hispanics are underrepresented in opt-in panels, although they are, but that the Blacks and Hispanics who do join these panels are different in unknown ways from those who do not. The strong implication is that “researchers using online nonprobability samples are at risk of drawing erroneous conclusions about the effects associated with race and ethnicity” (Kennedy et al., 2016, p. 4). 4
Conclusion
This study examined the generalizability of criminal justice attitudes from matched and unmatched opt-in samples. Compared to the GSS, a high-response rate probability sample, the descriptive statistics and the prevalence of attitudes were more similar in the matched opt-in sample (YouGov) than in the unmatched sample (MTurk). The regression coefficients in the matched opt-in sample (YouGov) were almost always in the same direction as the GSS coefficients, especially when statistically significant, and were often also of a similar magnitude. Less than 10% of the coefficients in the YouGov sample differed significantly from those in the GSS sample. By contrast, over 25% of the MTurk coefficients differed significantly from the GSS coefficients. Additionally, in a variable-by-variable comparison of coefficients, the matched sample’s coefficients were closer than the unmatched sample’s to the GSS’s about 63% of the time. The key takeaway is that criminologists should use matched rather than unmatched opt-in samples whenever probability samples are not feasible.
Our study is not without limitations. To assess generalizability, we compared our opt-in samples to the GSS, but research shows that the GSS itself differs from the U.S. Census’s Current Population Survey (Simmons & Bobo, 2015). 5 Other studies find that when compared against each other, there often are as many inconsistencies in results between different probability samples as between opt-in samples and probability samples (Ansolabehere & Schaffner, 2014; Bhutta, 2012). Thus, future research should replicate our study using multiple probability samples from different sources, not just from the GSS. Additionally, we only explored four criminal justice attitudes. It is possible that our findings would differ if we focused on other types of criminal justice attitudes, such as attitudes toward offender redeemabilty, body-worn cameras, or gun control. Future research should thus replicate our analysis using other attitudinal measures to further examine the utility of matched opt-in samples in criminal justice research. Finally, we focused on observational relational inferences, but research shows that opt-in samples perform better in experiments (Coppock, 2019; Mullinix et al., 2015), because of low levels of effect heterogeneity (Coppock et al., 2018). Although opt-in samples as a whole may be better suited for experiments, we would still expect matched opt-in samples to outperform unmatched samples in experimental studies as well, given their greater descriptive and attitudinal similarity to the general population. Future research should test this possibility.
In conclusion, our study replicates Thompson and Pickett’s (2019) by showing that the magnitude of coefficients in unmatched opt-in samples is often off, and extends their analysis by showing that findings from matched opt-in samples are more generalizable. In every comparison (Table 3), we found that the matched sample outperformed the unmatched sample. The coefficients in the matched sample were more likely to be in the same direction as the probability sample and were also more likely to be of a similar size. In fact, our findings suggest that researchers using matched opt-in samples to explore criminal justice attitudes would obtain coefficients that differ significantly from those in probability samples less than 10% of the time. Thus, matched opt-in samples provide a relatively inexpensive data source for criminal justice researchers, compared to probability samples, and also appear to carry a smaller generalizability penalty than unmatched samples.
Supplemental Material
sj-pdf-1-cad-10.1177_0011128720977439 – Supplemental material for Advantages of Matched Over Unmatched Opt-in Samples for Studying Criminal Justice Attitudes: A Research Note
Supplemental material, sj-pdf-1-cad-10.1177_0011128720977439 for Advantages of Matched Over Unmatched Opt-in Samples for Studying Criminal Justice Attitudes: A Research Note by Amanda Graham, Justin T. Pickett and Francis T. Cullen in Crime & Delinquency
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Notes
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
