On the Transportability of Laboratory Results

Abstract

The “transportability” of laboratory findings to other instances than the original implementation entails the robustness of rates of observed behaviors and estimated treatment effects to changes in the specific research setting and in the sample under study. In four studies based on incentivized games of fairness, trust, and reciprocity, we evaluate (1) the sensitivity of laboratory results to locally recruited student-subject pools, (2) the comparability of behavioral data collected online and, under varying anonymity conditions, in the laboratory, (3) the generalizability of student-based results to the broader population, and (4) with a replication at Amazon Mechanical Turk, the stability of laboratory results across research contexts. For the class of laboratory designs using incentivized games as measurement instruments of prosocial behavior, we find that rates of behavior and the exact behavioral differences between decision situations do not transport beyond specific implementations. Most clearly, data obtained from standard participant pools differ significantly from those from the broader population. This undermines the use of empirically motivated laboratory studies to establish descriptive parameters of human behavior. Directions of the behavioral differences between games, in contrast, are remarkably robust to changes in samples and settings. Moreover, we find no evidence for either anonymity effects nor mode effects potentially biasing laboratory measurement. These results underscore the capacity of laboratory experiments to establish generalizable causal effects in theory-driven designs.

Keywords

anonymity experimental methods external validity laboratory research mode effects online experiments prosocial behavior sample effects

Introduction

Laboratory experiments have a decisive methodological advantage over alternative modes of data generation in the social sciences: Group formation, randomization, and manipulation—while holding environmental factors constant—ease the testing of hypotheses regarding causes and effects (e.g., Falk and Heckman 2009; Shadish, Cook, and Campbell 2002; Webster and Sell 2014). Because of their support for causal inference (internal validity), many consider laboratory experiments the “gold standard” of scientific inquiry (e.g., Morgan and Winship 2015; Rubin 2008).¹ Note that our discussion of social experiments focuses on designs measuring actual behavior rather than behavioral intentions, attitudes, or opinions. In this tradition, laboratory research allows the elimination of plausible alternative explanations for results and generalization is directed to the support or nonsupport of theoretical principles (Thye 2014; Willer and Walker 2007; Zelditch 2014).

Others have criticized laboratory research in the social sciences due to its often questionable generalizability to the “real world” (external validity). In social science lab research, external validity refers first and foremost to lab–field generalizability and thus to the question whether individuals examined in the laboratory behave as they would in everyday life (Jackson and Cox 2013; Levitt and List 2007). This question is particularly relevant if one conceives of laboratory methods also as measuring instruments of certain types of behavior (e.g., Franzen and Pointner 2013; Glaeser et al. 2000; Rauhut and Winter 2010). Upstream requirements for external validity entail that laboratory results are robust to changes in both the specific research setup and the sample under study (Campbell and Stanley 1963; Cronbach 1982). These criteria convey the “transportability” (Pearl and Bareinboim 2014) of findings to other implementations beyond any specific design. After all, “[m]ost experiments are highly local but have general aspirations” (Shadish et al. 2002:18).

This article tests the minimal requirements for external validity of laboratory research in the social sciences. We conceptualize a laboratory design as a specific combination of subjects (units), stimuli (treatments), measurements (observations), and context (setting). This decomposition was first introduced by Cronbach (1982:78) and has been used by others (e.g., Gerber and Green 2012; Shadish et al. 2002) to evaluate experimental findings’ range of validity.

In four studies, we assess each dimension’s importance for establishing transportability: Study 1 varies units in a multilocation laboratory comparison conducted at two German universities, in Leipzig and Munich (pool generalizability). Study 2 targets observations and tests for comparability of behavioral data collected online and, under varying anonymity conditions, in the laboratory (mode generalizability). Study 3, a nationwide online implementation, again relates to units and tests our baseline results’ transportability to the broader population (sample generalizability). Study 4 concerns the setting of data collection (context generalizability) and—transporting our standardized decision situation into an online labor market—considers workers at Amazon’s crowdsourcing platform Mechanical Turk (MTurk). Different samples, modes, or settings may violate transportability in that they produce different rates of observed behaviors and—more worryingly for experimental research—heterogeneous treatment effects.

Our results ground on behavioral data collected in incentivized games of fairness, trust, and reciprocity from 2,664 subjects using the same decision interface. Throughout our four studies, we focus on two decision-making situations frequently used in the methodological research on laboratory designs: the dictator game (DG) and the trust game (TG). These games differ in complexity, carry the potential for socially desirable responses, and enable a direct comparison with an extant literature. Because socially acceptable (DG) and socially optimal (TG) behaviors diverge from first movers’ egoistic strategies, these games reveal expectations about valid norms of fairness (DG), trust, and reciprocity (TG) in a particular population and setting (Bicchieri 2006; Elster 2007). We compare behavior in these situations replicating the common finding that an investment opportunity (TG), rather than altruism (DG), motivates first movers to higher transfers. Our focus on the interplay between games advances prior studies on the transportability of lab results, allowing us to investigate how qualitative results (the ranking of mean transfers across games) and point estimates of behavioral differences between decision situations (the within-subject differences in transfers across games) generalize to other units, observations, and settings.

Detecting violations of transportability in laboratory designs constitutes a lively research area within experimental economics. This activity has led to significant advances in the way social scientists implement and interpret laboratory studies (see the overviews by Fréchette 2016; Galizzi and Navarro-Martínez 2018; Levitt and List 2007). These efforts have, however, remained selective, focusing on particular aspects of laboratory designs one at a time, such that results regarding lab findings’ sensitivity toward changes in implementation are often mixed. The “replication crisis” in the social sciences (Chang and Li 2015; Freese 2007; Open Science Collaboration 2015) reinforces the call for thorough tests of experimental reliability, more diverse samples, and taking into account of potentially heterogeneous treatment effects. The narrow variation of sociodemographics in standard experimental subject pools remains conspicuous (Druckman and Kam 2011; Henrich, Heine, and Norenzayan 2010; Peterson 2001), particularly when experimenters seek general insights into human behavior or estimate treatment effects that may interact with individuals’ background characteristics.

Against this backdrop, we systematically assess the transportability of laboratory results and map safe grounds for behavioral studies conducted both in the laboratory and online. In the remainder, we proceed as follows: In the second section, we use Cronbach’s (1982) decomposition of laboratory designs into units, treatments, observations, and settings to delineate how each dimension relates to the general desideratum of transportability. In the third section, we discuss established protocols useful in identifying threats to external validity in social science lab research. This review will motivate our own test strategies and highlight how our four studies complement the existing literature. In the fourth section, we outline our design. The fifth section presents our results. In the concluding section, we discuss our findings’ practical implications for experimenters in the social sciences.

Demands to Transportability

Cronbach (1982) defines laboratory designs as combinations of specific units, treatments, observations, and settings. The acronym utos refers to the particular “instances on which data are collected” (p. 78). Each dimension has consequences for the transportability of laboratory results (see also Shadish et al. 2002; we follow their simplified conceptualization).

Units refer to the participants of a laboratory study. For external validity, participants must be broadly representative of the target population to which one wishes to generalize. This implies random sampling from the target population and the use of inference statistics. Generalizability under nonrandom sampling requires—as a minimal condition—that sociodemographic characteristics relevant to sustaining expected treatment effects overlap in the subject pool and the target population.

Treatments represent the randomized stimuli participants are exposed to. Treatments should reproduce real-world conditions as closely as possible. It is most important for theory-driven experiments, however, that treatments closely represent the theoretical concepts under study (construct validity). In addition, treatments should be well calibrated. Subtle treatments, for example, induce the risk of experimenters mistaking a lack of treatment perception for a null result (treatment validity).

Observations denote the measurement of outcome variables. In laboratory studies, reactivity is a major measurement concern. Subjects’ feeling of being observed can shift measurements toward socially desirable outcomes (Pygmalion effect) and subjects potentially bias measurements by forming beliefs about the purpose of scientific inquiry (experimenter demand effect). Online experiments, which have recently become popular, offer increased anonymity but less control over the participants’ surroundings.

Settings characterize the context of data generation. Transportability, again, relates to the mapping of the laboratory setup to real-world conditions. The artificial lab context generally runs counter to this criterion but at least allows for “experimental realism”: Experimenters must implement theoretically relevant features in a way that allows participants to assign similar meanings as they would in natural contexts.

Cronbach’s dimensions indicate the range of validity for a given laboratory study. Results directly generalize to units, treatments, observations, and settings fully covered in the experiment (utos). Generalizations to conditions beyond those covered in the laboratory (which Cronbach terms UTOS) require additional bridging assumptions. The range of validity straddles, however, for conditions clearly deviating from the empirical implementation in at least one of these four dimensions. In such cases, Cronbach speaks of *UTOS. Here, the laboratory design no longer sustains transportability.

The main purpose of experimental research is to test causal relationships derived from theoretical hypotheses (Martin and Sell 1979; Willer and Walker 2007). External validity, however, is compromised if elements not randomized by the design, such as population, period, or setting, interact with the hypothesis under study (Zelditch 2014). Ideally, theory should inform about the scope of its application and delineate potential heterogeneous treatment effects to enable valid experimental tests. If underlying theories are incomplete—as in the case of “effect experiments” (Zelditch 2014:183)—the challenge of establishing external validity is much greater (Schram 2005). Behavioral economists, for example, frequently measure rates of behavior in incentivized decision situations from convenience samples in order to generalize regularities of human behavior (e.g., “social preferences”) or infer effects of “culture” on observed rates of behavior (see Kessler and Vesterlund 2013; Levitt and List 2007 for critique). Among other things, our research will underscore the problems associated with such empirically driven applications of laboratory designs.

Prior Results and Our Contribution

Following Cronbach’s (1982) typology as a structuring framework, we briefly discuss established designs for identifying threats to the transportability of laboratory results. We mainly draw on studies from experimental economics which, in the last decade, saw lively research on the methodological issues of laboratory research. We restrict our review to studies identifying potential violations based on protocols measuring prosocial behavior² and highlight how our four studies advance prior work.

Units

Multilocation experiments evaluate the sensitivity of laboratory results to locally recruited subject pools. Roth and colleagues’ (1991) parallel implementation of bargaining games at universities in Jerusalem, Ljubljana, Pittsburgh, and Tokyo is a classic in this domain. Close monitoring of local experimenters, careful translations, and the adjustment of stakes according to differences in purchasing power led the authors to conclude that “[b]ecause of the way the experiment was designed […] the differences in bargaining behavior among countries are not due to differences in languages, currencies, or experiments but may tentatively be attributed to cultural differences” (p. 1068). Many studies followed (e.g., Brandts, Saijo, and Schramand 2004; Henrich et al. 2001; Kocher et al. 2008), comparing elicited behaviors across locations; yet—just as in Roth et al. (1991)—they confound local pool effects with differences in nationality and culture. We fill this gap in study 1, comparing laboratory results across student-subject pools at two German universities in Leipzig and Munich.

A second design evaluates whether lab results from student participants transport to broader, more representative populations. Studies of this type invite nonstudent residents from the proximity of a university to participate in lab sessions (Anderson et al. 2013; Belot, Duch, and Miller 2015; Cappelen et al. 2015; Falk, Meier, and Zehnder 2013) and compare the results to control sessions featuring student participants.³ These comparisons find students less generous, trustful, and cooperative than their nonstudent counterparts. Apparently, social-preference parameters estimated from student pools do not generalize to more general samples.⁴ We complement these efforts in study 3, comparing our student baseline to findings from a nationwide implementation conducted over the Internet.

Observations

Subjects’ feelings of being observed can shift measurements toward socially desirable outcomes. A common strategy to assess reactivity in the laboratory relies on the variation of anonymity conditions. Extending from standard setups—which protect anonymity toward other subjects—Franzen and Pointner (2012) and Hoffman, McCabe, and Smith (1996) use procedures which ensure anonymity toward the experimenter as well (using blinds, anonymized envelopes, or randomized-response techniques). Both studies report decreased rates of socially desirable behavior under increased anonymity. Barmettler, Fehr, and Zehnder (2012), on the other hand, find no effect of anonymity toward the experimenter. We address subject reactivity with a manipulation of anonymity conditions in our two laboratories. If subjects’ feeling of being observed affects measurements in the laboratory, we should encounter less prosocial behavior with rising anonymity levels. Our manipulation does not aim at testing theoretical explanations of prosocial behavior (e.g., social control vs. internalized norms) but tests the comparability of data generated under different anonymization procedures commonly used in social science lab research.

Anonymity is also attainable through online experiments. These have become increasingly popular among social scientists due to both low costs and access to broad participant pools (e.g., Gosling et al. 2010; Rand 2012). Online experimenters, however, obtain no direct control over the participants’ surroundings, which may pose threats to internal validity (Clifford and Jerit 2014; Reips 2002). For example, subjects may find themselves observed by others during participation, search the Internet for eligible strategies, or disbelieve the supposed interaction with other human subjects. A rigorous test strategy for mode effects of data collection requires members of the same population to take part in the same study in either the lab or the online version. Drawing on student participants, Beramendi, Duch, and Matsuo (2016) find no mode effect on various outcome measures, including the DG and a modified version of the Public Goods Game. The authors, however, failed to randomize subjects effectively, leading to marked sociodemographic differences between lab and online participants. Hergueux and Jacquemet (2015), on the other hand, randomized students to parallel lab or online sessions. Their study finds higher rates of selfish behavior among lab subjects. In their study, however, online participants received payoff through PayPal and—being spared traveling to the physical lab—faced lower participation effort. We fix these issues in study 2, randomizing student subjects into either lab or online sessions while keeping participation effort constant across modes. We then compare our online results to lab results obtained under varying anonymity conditions.

Settings

Manipulations of the setting of data generation address the crucial issue of “real-world” generalizability. A growing number of studies compares lab behavior with choices made in concealed field experiments. In a rigorous variant of this design, researchers take efforts closely to map the artificial decision space (e.g., DG) onto the unobtrusive measurement (e.g., giving to a charity) and then exploit within-subject comparisons between settings. Typically, these studies find qualitative lab–field correspondence: Individuals who share, cooperate, or trust in the lab also exhibit more prosocial behavior in the field (e.g., Benz and Meier 2008; Englmaier and Gebhardt 2016; Franzen and Pointner 2013). Some implementations, however, report zero correlations (for a review, see Galizzi and Navarro-Martínez 2018) and, more importantly, the empirical evidence at hand is likely to suffer from publication bias (Coppock and Green 2015). An alternative design addressing the realism of experiments utilizes the sampling of professionals with relevant task experience (e.g., Alevy, Haigh, and List 2007; Fehr and List 2004; Potters and van Winden 2000) in “framed field experiments” (Harrison and List 2004:1014): Because legislators, managers, and traders import their day-to-day experiences into the experimental situation, instructions can trigger work-related frames and heuristics altering the context of the experiment (Fréchette 2015).

A related and recently popularized strategy to vary experimental settings makes use of the large and heterogeneous participant pool sustained at MTurk (e.g., Amir, Rand, and Gal 2012; Berinsky, Huber, and Lenz 2012; Crump, McDonnell, and Gureckis 2013). Many consider the platform a real online labor market (Horton, Rand, and Zeckhauser 2011; Rand 2012) in which workers seek profit-maximizing allocation of time and qualification. In addition, many workers at MTurk are experienced participants in social experiments (Chandler, Mueller, and Paolacci 2014; Rand et al. 2014) and the perceived social distance is likely to be larger among MTurk participants than among traditional laboratory subjects. As a result, experimenters can expect to observe different and more “rational” situational logics than what one is used to from physical laboratories. In study 4, we replicate our online implementation at MTurk to test the robustness of behavioral data collection against a change in the research setting.

Treatments

Methodological research on laboratory designs makes frequent use of two decision situations, the dictator game (DG) and the trust game (TG). In each situation, participants must choose between self-interested and socially desirable behaviors. The games are thus natural candidates for investigations into anonymity effects, mode effects, and the sensitivity of results to different samples and contexts.

In DG, a participant receives a monetary stake and can decide how much of the pie (0–100 percent) she passes to a receiver (Kahneman, Knetsch, and Thaler 1986). Experimenters typically interpret giving as a manifestation of prosocial preferences. TG, on the other hand, mimics an investment decision, thereby introducing the possibility of nonreciprocity by a second mover (Berg, Dickhaut, and McCabe 1995): A trustor and a trustee each receive a stake. The trustor can decide how much of her stake (0–100 percent ) she sends to the trustee. The experimenter doubles this amount. The trustee then decides how much of the doubled amount (0–100 percent) she sends back to the trustor. Placing trust depends on the trustor’s belief in the validity of a prosocial norm of reciprocity securing trustee’s trustworthiness. Unlike in DG, first movers are required to form expectations on second movers’ likelihood of reciprocation (Glaeser et al. 2000).

The two decision situations do not qualify as experiments due to their lack of treatments. The interplay between games, however, allows us to replicate the common finding (e.g., Camerer 2003; Camerer and Fehr 2004) that an investment opportunity (TG) motivates first movers to higher transfers than altruism (DG). We expect first movers to share more in TG than in DG. Specifically, we test whether the differences in mean transfers transport to different samples, modes, and settings. Substituting TG for DG varies a bundle of aspects (e.g., parametric vs. interactive decision situation, endowment for one vs. two players), and hence, our variation does not permit isolation of a narrow causal effect that is more typical of the sociological literature using experiments. Still the within-subject comparison across games provides an estimate of the “treatment” effect of changing from one decision situation to another. Our focus on this interplay extends prior studies of lab results’ transportability, as it allows us to investigate how qualitative results (the ranking of transfers across games) and behavioral differences between laboratory conditions (the within-subject differences in transfers across games) generalize to other units, observations, and settings.

Design

Table 1 summarizes the different study designs (see Online Appendix A1 for sample descriptives). We first describe the sampling of participants. Our procedures then include randomization, instructions, incentives, collection of survey data, and payoff.

Table 1.

Study Details.

Study		Location	Participants	Data Collection	$N$	# Prior Experiments	% Without Experience	Endowment
								Dictator Game	Trust Game
1	Pool generalizability
1	Parallel lab sessions with newly recruited participant pools	Leipzig Munich	Local students	April 21 to June 7, 2016	362351	1.6	70.4	€10.0	€5.0
2	Mode generalizability
2	Parallel online sessions with newly recruited participant pools	Leipzig Munich	Local students	April 21 to June 7, 2016	122115	1.0	78.5	€10.0	€5.0
3	Sample generalizability
3	Nationwide online experiment	Germany	Representative of German-born population regarding gender, age (18–69), region	June 10 to 27, 2016	1,223	1.2	84.2	€5.0	€2.5
4	Context generalizability
4	Replication in an online labor market	MTurk	MTurk workers from the United States and India	March 4 to June 3, 2017	491	65.5	49.5	$2.0	$1.0

Note: # Prior experiments refers to the average number of incentivized experiments subjects had taken prior to participation. Studies 1 and 2, at each location, draw on the same local student pool. We used the web-based software hroot (Bock, Baetge, and Nicklisch 2014) to randomize invitations to lab and online sessions. Endowments refer to euros (studies 1–3) or U.S. dollars (study 4).

Sampling

For studies 1 and 2, we established two student-subject pools at universities in Leipzig and Munich. We standardized recruiting across both locations advertising sign-up in introductory lectures, campus cafeterias, and university websites. From each pool, we randomly selected registered students to participate in a given lab or online session synchronized across locations. In study 3, we examine a cross section of the German population sampled from Forsa’s offline-recruited online access panel. Forsa uses county-level random digit dialing to register participants who privately use the Internet at least once a week. Our sample is representative of the German-born population with regard to gender, age, and administrative district and highly heterogeneous with regard to education, occupation, and income. In study 4, we replicate our setup at MTurk, recruiting workers from the United States and from India. Both countries make up the largest shares of platform participants (Ipeirotis 2018). For each country, we advertised participation twice per day (early morning and late afternoon local time).

Randomization

Each subject participated in DG and TG. We randomized participants to sequences of games and first- and second-mover roles. The absence of feedback in-between games secured independence of sequential behavior, enabling within-subject comparison of decision situations. To neutralize reputation effects, we randomly matched participants to another anonymous participant for each decision.⁵

Instructions

We standardized the decision interface in our four studies using a web-browser implementation based on the package SoSci Survey (www.soscisurvey.de). Our instructions map participants’ choices to payoffs as clearly as possible using GIF-animated examples but avoiding suggestion of specific strategies or frames. We only allowed individual transfers in each game to be multiples of 10 percent of the endowment (including 0 percent). In studies 1–3, we used instructions in German; study 4 uses similar instructions in English (see Online Appendix A6). We monitored understanding using control questions following each decision.

Incentives

In studies 1 and 2, we incentivized DG with €10; in TG, each player received an endowment of €5. Rather than keeping stakes constant across samples, we chose monetary incentives typical for the respective participant pool to counter self-selection based on monetary motivations; in studies 1 and 2, stakes also need to cover subjects’ effort to travel to the laboratory. In study 3, DG was worth €5 and each player in TG received an endowment of €2.5. In study 4, stakes were US$2 in DG and US$1 in TG. Critics may find fault at our heterogeneous stake levels pointing to the idea that observed prosociality may decrease in stake sizes. Prior evidence from laboratory (e.g., Camerer and Hogarth 1999; Carpenter, Verhoogen, and Burks 2005) and online studies (e.g., Amir et al. 2012; Keuschnigg, Bader, and Bracher 2016), however, indicates that—although monetary stakes increase selfishness compared to unincentivized games—differences in positive stakes have negligible effects on laboratory results in fairness and cooperation research.

Survey Data

We requested each participant to fill out a questionnaire including items on sociodemographics, experimental experience, and—in our online studies 2–4—the physical and social surroundings during participation (see Online Appendix A1). We administered the questionnaire at the end of each session.

Payoff

To compute individual payoff, we randomly drew one of ego’s (and partner’s) decisions in the games. We made randomized rewards (Bolle 1990) common knowledge in our instructions, explaining that each decision could fully determine a participant’s reward. In studies 1 and 2, we paid participants in cash at the end of each session. Payoff included a fixed showup fee of €2.50 and additional earnings of €5.13 on average (min = 0.00, max = 15.00). In study 3, participants received payoff in the form of an Amazon voucher, complying to Forsa’s standard payment scheme. We set the showup fee to 2.00, additional earnings average €2.83 (min = 0.00, max = 7.50). In study 4, workers received payoffs via MTurk. As typically done in online experiments at MTurk, we chose a showup fee of US$1, additional earnings average US$0.85 (min = 0.00, max = 3.00).

We introduced additional manipulations to identify both anonymity effects in laboratory data collection and a potential mode effect between laboratory and online data collection. We randomized each experimental treatment on the session level.

Anonymity

Each lab participant in Leipzig and Munich was presented with one of three anonymity conditions (see Online Appendix A7 for photographic documentation). (1) Low anonymity: In this control condition, workplaces had no shielding and participants could see one another while taking decisions ( $N = 115$ in Leipzig and 113 in Munich). After completion, the experimenter called each participant by her seat number to receive payoff individually at the experimenter’s desk. (2) Standard anonymity: Blinds shielded each workplace to create inter-subject anonymity ( $N = 116$ in Leipzig and 122 in Munich). After completion, we followed the above payoff procedure. This setup is typical for most laboratory implementations of social experiments. (3) High anonymity: We also placed the experimenter behind a blind to prevent visual contact throughout data collection ( $N = 131$ in Leipzig and 116 in Munich). After completion, the experimenter called participants by their seat numbers and each subject received payoff individually in a designated payment room outside the lab from a person who did not appear as an experimenter in the process of the experiment. This person sat behind a closed door with a mail slot through which each participant handed over her seat number and received payoff in an anonymized envelope. This setup creates anonymity toward both other participants and the experimenter. We made the respective anonymity condition common knowledge upon arrival. For treatment validity, we provided a detailed description of the relevant scheme in our opening instructions ensuring complete understanding of the setup.

Modes

We randomized student subjects in Leipzig and Munich to participate in either a laboratory or an online session. To avoid self-selection, we informed participants only after enrollment for a certain session about the respective mode of data collection. We held online sessions simultaneously to our laboratory sessions, thus neutralizing the “mode selection effect” and isolating the “mode measurement effect” (Hox, de Leeuw, and Klausch 2017:511). To homogenize participation effort, online participants ( $N = 122$ in Leipzig and 115 in Munich) had to collect their payoff in cash within one week after completion from the respective university’s laboratory, where we followed the high-anonymity payment scheme outlined above (about which online participants knew upon entering the experiment). Apart from identifying mode effects, this treatment permits a rigorous isolation of pool effects: Online implementation absorbs potential effects from both local experimenter characteristics and the two laboratories’ overall physical appearance.

Results

Figure 1 summarizes dictators’ and trustors’ average transfers (as percentages of their individual endowments) across studies. Pooled across samples, modes, and research contexts, the mean allocation in DG is 42.2 percent. Changing from DG to TG increases average transfers by 10.4 percentage points to 52.6 percent.⁶

Figure 1.

Quantitative results. Shaded bars show unconditional means of first-mover transfers in the dictator game (DG) and the trust game (TG), respectively. Blank bars represent conditional means obtained from ordinary least square (OLS) regressions keeping underlying sociodemographics constant. We include 95 percent confidence intervals and seven pairwise comparisons (t tests). ***p < .001, **p < .01, *p < .05. n.s. = nonsignificant.

To evaluate transportability of quantitative results, we test their sensitivity to locally recruited student-subject pools (study 1), the comparability of behavioral data collected online and, under varying anonymity conditions, in the laboratory (study 2), the generalizability of elicited behavior from student participants to the broader population (study 3), and the stability of results across settings (study 4). This entails running seven pairwise comparisons for both DG and TG as indicated in Figure 1. We report p values of two-sided t tests with robust standard errors throughout. To account for different sociodemographic compositions in our samples, we further adjust our measures by a list of participants’ background characteristics (see Online Appendix A2 for model specification). To speak of cross-sample differences in rates of observed behavior, those need to resist conditioning. Five of the seven pairwise comparisons return significant differences for DG, but only two of the seven do so for TG.⁷

Pool Generalizability

In study 1, we find marked differences in mean DG allocations across student pools: In Leipzig, dictators allocate 42.7 percent on average; in Munich, they share only 35.5 percent of their endowment (see shaded bars in Figure 1, top panel). This gap remains after controlling for sociodemographic differences in local pool composition (blank bars; t = 4.44, p < .001). In TG, Leipzig students transfer 53.5 percent on average; in Munich, this rate is 49.3 percent. These rates are not significantly different under conditioning on sociodemographics (blank bars in Figure 1, bottom panel; t = 1.76, p = .079). Our synchronized test thus establishes pool generalizability for investment decisions. For altruistic donations, however, elicited behavior varies considerably between locations.

Mode Generalizability

In study 2, we find no evidence for a mode effect of data collection. For both games, elicited behavior is not significantly different irrespective of whether we run the study in a laboratory or online. This holds for student pools in both Leipzig (t = 0.36, p = .721 in DG; t = 1.52, p = .129 in TG) and Munich (t = 0.27, p = .785 in DG; t = 0.02, p = .987 in TG). These results further substantiate the cross-location difference found for DG in study 1: By shutting off potential experimenter effects and differences in labs’ physical appearance, the online design identifies the gap between locations as a genuine pool effect. Furthermore, because the gap resists conditioning on sociodemographics, unstable results across locations obviously do not stem from different pool compositions.

Anonymity

In Figure 2, we compare online results to the three anonymity conditions participants faced in our physical labs.⁸ At each location, lab results do not differ across anonymity conditions. Setups creating anonymity toward other participants (standard anonymity; t = 0.48, p = .633 in DG; t = 1.13, p = .261 in TG) and, additionally, toward the experimenter (high anoymity; t = 0.04, p = .966 in DG; t = 0.61, p = .541 in TG) do not yield different results than a low-anonymity setup. Similarly important, the results from either anonymity condition do not differ significantly from our online implementations at both locations.⁹ Anonymity effects, it seems, are not a major concern for laboratory research.

Figure 2.

Anonymity conditions in the laboratory. Shaded bars show unconditional means of first-mover transfers in the dictator game (DG) and the trust game (TG), respectively. Blank bars represent conditional means obtained from OLS regressions keeping underlying sociodemographics constant. We include 95 percent confidence intervals. All pairwise comparisons between anonymity conditions are nonsignificant.

Sample Generalizability

In study 3, we contrast student-based results to those obtained in the broader population (Figure 1). Because we ran our nationwide study over the Internet, our online results among students provide the relevant benchmark. Even after controlling for sociodemographic differences, results for students (39.4 percent in DG; 48.4 percent in TG) do not generalize to a broader population sample, whose members on average share significantly more in both DG (47.7 percent; t = 3.72, p < .001) and TG (58.8 percent; t = 3.20, p = .001). Differences in comprehension of instructions may further aggravate direct comparisons between student and nonstudent samples. To test whether difficulties in understanding drive prosocial choices in our nationwide sample, we introduced a time-pressure/time-delay treatment for nonstudent participants. We report these results in Online Appendix A3 and find no statistically significant effect of this manipulation, suggesting that difficulties in understanding do not explain higher rates of prosocial behavior among non-students.

Context Generalizability

In study 4, we replicate our online implementation at MTurk to test the robustness of behavioral data collection against a change in the research setting (Figure 1). On average, crowdworkers allocate 33.7 percent in DG and transfer 42.4 percent in TG. Both rates are lower than the quantitative results obtained in our university-implemented setting using volunteer student participants (t = 2.41, p = .016 in DG; t = 1.76, p = .078 in TG) and participants from the broader population (t = 9.61, p < .001 in DG; t = 8.09, p < .001 in TG). Quantitative results, already heterogeneous across student and nonstudent samples, apparently do not transport to another setting. MTurk workers use more “rational” situational logics than we find among either students or members of the wider population in Germany. We find only little evidence, however, for different decision-making by experienced participants: Non-naive subjects in all studies, on average, share less in DG—but only 0.005 percentage points per prior experiment—while experience has zero effect in TG (see Online Appendix A2). Note that our results are robust to the adjustment for experience. Hence, experience cannot explain the behavioral differences between the crowdworkers and the participants in our remaining studies.

In Figure 3, we test for the stability of behavioral differences between decision situations across samples, modes, and settings. We focus on conditional means, keeping sociodemographic composition constant across studies. Shaded bars show average DG allocations. Blank bars on top represent first-mover transfers in TG. The difference between shaded and blank bars shows by how much, in each study, TG transfers exceed DG allocations. We include 95 percent confidence intervals for the difference in average transfers between DG and TG. Unlike our statistical tests above, which we based on between-subject comparison, we now use variation within subjects to test for significance of this “treatment” effect. Our main qualitative result is robust to changes in units, observations, and settings: In each study, average TG transfers exceed DG allocations substantially and significantly. The exact size of this difference, however, varies considerably. Among student participants in studies 1 and 2, differences between DG and TG are small in Leipzig (8.6 percentage points, t = 3.78, p < .001) but large in Munich (12.0 percentage points, t = 5.46, p < .001). In Munich, TG substantially increases transfers, compensating for the lower propensity to share in a situation of altruism. Using mean DG allocations as a baseline, the change to TG raises average transfers by 33.1 percent in Munich, but only by 20.2 percent in Leipzig (t = 1.96, p = .050). TG in the nationwide sample raises average sharing as measured in DG by 23.3 percent (t = 8.43, p < .001) and, at MTurk, by 25.6 percent (t = 5.16, p < .001).

Figure 3.

Qualitative results. Shaded bars show conditional means of first-mover transfers in the dictator game (DG). Blank bars on top represent conditional means in the trust game (TG). 95 percent confidence intervals, here, indicate significance of the within-subject difference DG–TG (paired t tests).

Implications

In laboratory research, the benefits of artificiality—systematic variation of experimental conditions, control of confounders, and replicability—trade off with generalizability to the “real world.” We used Cronbach’s (1982) decomposition of experiments into units, treatments, observations, and settings to identify those parts of laboratory designs which undermine their external validity. Different samples, types of stimuli, measurement modes, and research contexts may violate transportability in that they produce varying rates of observed behaviors and—more worryingly for experimental research—heterogeneous treatment effects. In four studies, we assessed each dimension’s importance for establishing transportability.

We demonstrated that a common class of laboratory designs—interactive games measuring fairness, trust, and reciprocity—easily violates the transportability of percentage rates of observed behavior: First, synchronized lab implementations revealed substantial differences in elicited behavior between two locally recruited student-subject pools (study 1). This cross-location gap persists in alternative online implementations (study 2), which shut off potential experimenter effects and differences in labs’ physical appearances. One may thus speculate about regional idiosyncrasies bringing about specific patterns of behavior that jeopardize pool generalizability. Second, we find much higher rates of prosocial behavior among a broader population sample (study 3), indicating a lack of sample generalizability, as results yielded from student participants do not transport to a more representative population. Third, in a replication at MTurk (study 4), we find rates of prosocial behavior even lower than in our student samples. This clearly rejects context generalizability of quantitative results.

Even when keeping sociodemographics constant, data collected from the most frequently used participant groups, students and crowdworkers, differ significantly from those obtained from the broader population—and altruistic behavior as measured in the DG proved to be particularly sensitive to changes in units and settings. We chose stake levels typical for the respective participant pool. As a side effect, we cannot fully rule out the possibility that differences across samples and contexts may be partly due to differences in monetary incentives. Given the well-documented finding that specific sizes of positive stakes have negligible effects in interactive games of fairness, trust, and reciprocity (Johnson and Mislin 2011; Larney, Rotella, and Barclay 2019), it is highly unlikely that stake differences drive our results. In fact, we find the lowest level of prosocial behavior in the setup providing the smallest stakes (study 4)—a finding that runs counter the idea that prosociality decreases in stake sizes.

Our unstable quantitative results indicate that preference parameters (such as “prosociality”) measured in laboratory designs cannot be transported to other populations, and their use in establishing descriptive results about “human nature” is questionable. The heterogeneity in elicited behavior that we found for decision situations targeting prosociality presumably also affects laboratory studies using other types of decision situations. Hence, interpretations of marginal totals obtained in laboratory research remain descriptive and studies reporting an intervention’s consequence in absolute terms risk describing only highly local results. However, nobody in the social sciences would expect a volunteer sample of, say, student respondents to generate survey data identical to a random population sample. The local bound of descriptive results is thus not an exclusive feature of laboratory designs but mainly a sampling issue.

Against this cautionary backdrop, our results sustain an optimistic view for the external validity of theory-driven experiments focusing on the identification of causal effects. Qualitative results, in our case the finding that transfers in the TG on average exceed DG allocations, are remarkably robust across samples, measurement modes, and research contexts. The problem of unstable results reemerges, however, if experimenters estimate treatment effects from contrasting with an unstable control condition. In our studies, differences in mean transfers between DG and TG vary as altruistic decisions in DG interact with both the characteristics of the sample under consideration and the specific setting. If control conditions provide unstable measures like DG, point estimates of treatment effects may be seriously biased. Heterogeneous treatment effects then stem from an unstable control condition rather than from heterogeneous responses to the treatment itself.

Similarly important for practitioners, we find that specific implementations of a laboratory study do not distort its results. For our physical labs, we find no evidence for anonymity effects in data collection—although decisions concerning prosocial behavior should be particularly liable to social desirability bias. If we increase participant anonymity, rates of elicited behavior do not differ significantly from conditions lacking specific anonymization measures. Our results are thus in line with Barmettler et al. (2012), who question the necessity of complicated anonymization procedures in social science laboratories. Reactivity may still drive behavior, but we find no effect within the spectrum of anonymity precautions typically used in experiments. This suggests comparability of results from laboratory studies differing in this respect. Finally, we find full support for mode generalizability. Keeping participation effort constant across parallel lab and online sessions, participants from two student-subject pools generated similar data, irrespective of participating in the lab or online. Taking into account laboratory studies’ weak generalizability to a broader, more representative population, we believe that online experiments can serve as a sorely needed complement to laboratory designs in the social sciences.

To conclude, successful and meaningful laboratory research in sociology—just as in any other empirical discipline—requires joint efforts of ongoing replication. We can only regard laboratory results as well-established facts after their successful cross-validation, ideally in studies using complementary samples and designs. This also holds for our results, which we hope will be replicated in future studies using alternative decision situations frequently used in social science lab research.

Supplemental Material

Supplemental Material, Appendix_Transportability - On the Transportability of Laboratory Results

Supplemental Material, Appendix_Transportability for On the Transportability of Laboratory Results by Felix Bader, Bastian Baumeister, Roger Berger and Marc Keuschnigg in Sociological Methods & Research

Supplemental Material

Supplemental Material, Appendix_Transportability - On the Transportability of Laboratory Results

Footnotes

Authors’ Note

Felix Bader and Marc Keuschnigg contributed equally to this work.

Acknowledgments

We thank Peter Hedström, Karl-Dieter Opp, Merlin Schaeffer, Tobias Wolbring, and three anonymous reviewers for valuable comments. We are grateful to Hanna Nau, Leona Przechomski, Lennart Rösemeier, Fabian Thiel, Janine Thiel, and Anna Wolf for excellent research assistance and to Marion Apelt and Regina Heindl for administrative support.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project received financing through generous grants from the German Research Foundation (BE 2373/3-1 and KE 2020/2-1). Marc Keuschnigg further acknowledges funding from the European Research Council (324233), the Swedish Research Council (445-2013-7681, 340-2013-5460), and Riksbankens Jubileumsfond (M12-0301:1).

ORCID iDs

Felix Bader

Marc Keuschnigg

Supplemental Material

Supplemental material for this article is available online.

Notes

References

Alevy

Jonathan E.

Haigh

Michael S.

List

John A.

. 2007. “Information Cascades: Evidence from a Field Experiment with Financial Market Professionals.” Journal of Finance 62:151–80.

Amir

Ofra

Rand

David G.

Gal

Ya’akov Kobi

. 2012. “Economic Games on the Internet: The Effect of $1 Stakes.” PLoS One 7:e31461.

Anderson

Jon

Burks

Stephen V.

Carpenter

Jeffrey

Götte

Lorenz

Maurer

Karsten

Nosenzo

Daniele

Potter

Ruth

Rocha

Kim

Rustichini

Aldo

. 2013. “Self-selection and Variations in the Laboratory Measurement of Other-regarding Preferences across Subject Pools: Evidence from One College Student and Two Adult Samples.” Experimental Economics 16:170–89.

Barmettler

Franziska

Fehr

Ernst

Zehnder

Christian

. 2012. “Big Experimenter Is Watching You! Anonymity and Prosocial Behavior in the Laboratory.” Games and Economic Behavior 75:17–34.

Bellemare

Charles

Kröger

Sabine

. 2007. “On Representative Social Capital.” European Economic Review 51:183–202.

Belot

Michele

Duch

Raymond

Miller

Luis

. 2015. “A Comprehensive Comparison of Students and Non-students in Classic Experimental Games.” Journal of Economic Behavior & Organization 113:26–33.

Benz

Matthias

Meier

Stephan

. 2008. “Do People Behave in Experiments as in the Field? Evidence from Donations.” Experimental Economics 11:268–81.

Beramendi

Pablo

Duch

Raymond M.

Matsuo

Akitaka

. 2016. “Comparing Modes and Samples in Experiments: When Lab Subjects Meet Real People.” SSRN Research Paper 2840403.

Berg

Joyce E.

Dickhaut

John

McCabe

Kevin

. 1995. “Trust, Reciprocity, and Social History.” Games and Economic Behavior 10:122–42.

10.

Berinsky

Adam J.

Huber

Gregory A.

Lenz

Gabriel S.

. 2012. “Evaluating Online Labor Markets for Experimental Research: Amazon.com’s Mechanical Turk.” Political Analysis 20:351–68.

11.

Bicchieri

Cristina

. 2006. The Grammar of Society: The Nature and Dynamics of Social Norms. New York: Cambridge University Press.

12.

Bock

Olaf

Baetge

Ingmar

Nicklisch

Andreas

. 2014. “hroot: Hamburg Registration and Organization Online Tool.” European Economic Review 71:117–20.

13.

Bolle

Friedel

. 1990. “High Reward Experiments without High Expenditure for the Experimenter?” Journal of Economic Psychology 11:157–67.

14.

Brandts

Jordi

Saijo

Tatsuyoshi

Schramand

Arthur

. 2004. “How Universal Is Behavior? A Four Country Comparison of Spite and Cooperation in Voluntary Contribution Mechanisms.” Public Choice 119:381–424.

15.

Camerer

Colin F.

2003. Behavioral Game Theory: Experiments in Strategic Interaction. New York: Sage.

16.

Camerer

Colin F.

Fehr

Ernst

. 2004. “Measuring Social Norms and Preferences Using Experimental Games: A Guide for Social Scientists.” Pp. 55–95 in Foundations of Human Sociality, edited by Henrich

Boyd

Bowles

Camerer

Fehr

Gintis

. Oxford, UK: Oxford University Press.

17.

Camerer

Colin F.

Hogarth

Robin M.

. 1999. “The Effects of Financial Incentives in Experiments: A Review and Capital-labor-production Framework.” Journal of Risk and Uncertainty 19:7–42.

18.

Campbell

Donald T.

Stanley

Julian C.

. 1963. Experimental and Quasi-experimental Designs for Research. Chicago, IL: Rand McNally.

19.

Cappelen

Alexander W.

Nygaard

Knut

Sorensen

Erik O.

Tungodden

Bertil

. 2015. “Social Preferences in the Lab: A Comparison of Students and a Representative Population.” Scandinavian Journal of Economics 117:1306–26.

20.

Carpenter

Jeffrey

Verhoogen

Eric

Burks

Stephen

. 2005. “The Effect of Stakes in Distribution Experiments.” Economics Letters 86:393–98.

21.

Chandler

Jesse

Mueller

Pam

Paolacci

Gabriele

. 2014. “Nonnaïveté among Amazon Mechanical Turk Workers: Consequences and Solutions for Behavioral Researchers.” Behavioral Research Methods 46:112–30.

22.

Chang

Andrew C.

Phillip

. 2015. “Is Economics Research Replicable? Sixty Published Papers From Thirteen Journals Say “Usually Not”.” Finance and Economics Discussion Series 2015-083. Washington: Board of Governors of the Federal Reserve System, https://dx-doi-org.web.bisu.edu.cn/10.17016/FEDS.2015.083.

23.

Clifford

Scott

Jerit

Jennifer

. 2014. “Is There a Cost to Convenience? An Experimental Comparison of Data Quality in Laboratory and Online Studies.” Journal of Experimental Political Science 1:120–31.

24.

Cooper

David J.

Kagel

John H.

. 2016. “Other-regarding Preferences: A Selective Survey of Experimental Results.” Pp. 217–89 in The Handbook of Experimental Economics, Vol. 2, edited by Kagel

Roth

. Princeton, NJ: Princeton University Press.

25.

Coppock

Alexander

Green

Donald P.

. 2015. “Assessing the Correspondence between Experimental Results Obtained in the Lab and Field: A Review of Recent Social Science Research.” Political Science Research and Methods 3:113–31.

26.

Cronbach

Lee J.

1982. Designing Evaluations of Educational and Social Programs. San Francisco, CA: Jossey-Bass.

27.

Crump

Matthew J. C.

McDonnell

John V.

Gureckis

Todd M.

. 2013. “Evaluating Amazon’s Mechanical Turk as a Tool for Experimental Behavioral Research.” PLoS One 8:e57410.

28.

Druckman

James N.

Kam

Cindy D.

. 2011. “Students as Experimental Participants.” Pp. 41–57 in The Cambridge Handbook of Experimental Political Science, edited by Druckman

Green

Kuklinski

Lupia

. Cambridge, MA: Cambridge University Press.

29.

Dunning

Thad

. 2012. Natural Experiments in the Social Sciences: A Design-based Approach. Cambridge, MA: Cambridge University Press.

30.

Elster

Jon

. 2007. Explaining Social Behavior: More Nuts and Bolts for the Social Sciences. Cambridge, MA: Cambridge University Press.

31.

Engel

Christoph

. 2011. “Dictator Games: A Meta Study.” Experimental Economics 14:583–610.

32.

Englmaier

Florian

Gebhardt

Georg

. 2016. “Social Dilemmas in the Laboratory and in the Field.” Journal of Economic Behavior & Organization 128:85–96.

33.

Falk

Armin

Heckman

James

. 2009. “Lab Experiments Are a Major Source of Knowledge in the Social Sciences.” Science 326:535–38.

34.

Falk

Armin

Meier

Stephan

Zehnder

Christian

. 2013. “Do Lab Experiments Misrepresent Social Preferences? The Case of Self-selected Student Samples.” Journal of the European Economic Association 11:839–52.

35.

Fehr

Ernst

Fischbacher

Urs

Rosenbladt

Bernhard von

Schupp

Jürgen

Wagner

Gert G.

. 2002. “A Nation-wide Laboratory: Examining Trust and Trustworthiness by Integrating Behavioral Experiments into Representative Surveys.” Schmollers Jahrbuch 122:519–42.

36.

Fehr

Ernst

Gintis

Herbert

. 2007. “Human Motivation and Social Cooperation: Experimental and Analytical Foundations.” Annual Review of Sociology 33:43–64.

37.

Fehr

Ernst

List

John A.

. 2004. “The Hidden Costs and Returns of Incentives: Trust and Trustworthiness among CEOs.” Journal of the European Economic Association 2:743–71.

38.

Franzen

Axel

Pointner

Sonja

. 2012. “Anonymity in the Dictator Game Revisited.” Journal of Economic Behavior & Organization 81:74–81.

39.

Franzen

Axel

Pointner

Sonja

. 2013. “The External Validity of Giving in the Dictator Game: A Field Experiment Using the Misdirected Letter Technique.” Experimental Economics 16:155–69.

40.

Fréchette

Guillaume R.

2015. “Laboratory Experiments: Professionals versus Students.” Pp. 360–90 in The Handbook of Experimental Economic Methodology, edited by Fréchette

Schotter

. New York: Oxford University Press.

41.

Fréchette

Guillaume R.

2016. “Experimental Economics across Subject Populations.” Pp. 435–80 in The Handbook of Experimental Economics, Vol. 2, edited by Kagel

Roth

. Princeton, NJ: Princeton University Press.

42.

Fréchette

Guillaume R.

Schotter

Andrew

. 2015. The Handbook of Experimental Economic Methodology. New York: Oxford University Press.

43.

Freese

Jeremy

. 2007. “Replication Standards for Quantitative Social Science: Why Not Sociology?” Sociological Methods and Research 36:153–72.

44.

Galizzi

Matteo M.

Navarro-Martínez

Daniel

. 2018. “On the External Validity of Social-preference Games: A Systematic Lab-field Study.” Management Science, Article in Advance; 1–27.

45.

Gerber

Alan S.

Green

Donald P.

. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: Norton.

46.

Glaeser

Edward L.

Laibson

David I.

Scheinkman

Jose A.

Soutter

Christine L.

. 2000. “Measuring Trust.” Quarterly Journal of Economics 115:811–46.

47.

Gosling

Samuel D.

Sandy

Carson J.

John

Oliver P.

Potter

Jeff

. 2010. “Wired But Not WEIRD: The Promise of the Internet in Reaching More Diverse Samples.” Behavior and Brain Science 33:34–35.

48.

Harrison

Glenn W.

List

John A.

. 2004. “Field Experiments.” Journal of Economic Literature 42:1009–55.

49.

Henrich

Joseph

Boyd

Richard

Bowles

Samuel

Camerer

Colin F.

Fehr

Ernst

Gintis

Herbert

McElreath

Richard

. 2001. “In Search of Homo Economicus: Behavioral Experiments in 15 Small-scale Societies.” American Economic Review 91:73–78.

50.

Henrich

Joseph

Heine

Steven J.

Norenzayan

Ara

. 2010. “The Weirdest People in the World?” Behavioral and Brain Sciences 33:1–23.

51.

Hergueux

Jérôme

Jacquemet

Nicolas

. 2015. “Social Preferences in the Online Laboratory: A Randomized Experiment.” Experimental Economics 18:251–83.

52.

Hoffman

Elizabeth

McCabe

Kevin A.

Smith

Vernon L.

. 1996. “Social Distance and Other-regarding Behavior in Dictator Games.” American Economic Review 86:653–60.

53.

Horton

John J.

Rand

David G.

Zeckhauser

Richard J.

. 2011. “The Online Laboratory: Conducting Experiments in a Real Labor Market.” Experimental Economics 14:399–425.

54.

Hox

Joop

Leeuw

Edith de

Klausch

Thomas

. 2017. “Mixed-mode Research: Issues in Design and Analysis.” Pp. 511–30 in Total Survey Error in Practice, edited by Biemer

P. P.

de Leeuw

Eckman

Edwards

Kreuter

Lyberg

L. E.

Tucker

N. C.

West

B. T.

. Hoboken, NJ: Wiley.

55.

Ipeirotis

Panagiotis G.

2018. MTurk Tracker. Retrieved October 8, 2018 from http://demographics.mturk-tracker.com/#/countries/all.

56.

Jackson

Michelle

Cox

David R.

. 2013. “The Principles of Experimental Design and Their Application in Sociology.” Annual Review of Sociology 39:27–49.

57.

Johnson

Noel D.

Mislin

Alexandra A.

. 2011. “Trust Games: A Meta-analysis.” Journal of Economic Psychology 32:865–89.

58.

Kahneman

Daniel

Knetsch

Jack L.

Thaler

Richard H.

. 1986. “Fairness and the Assumptions of Economics.” Journal of Business 59:285–300.

59.

Kessler

Judd

Vesterlund

Lise

. 2013. “The External Validity of Laboratory Experiments: The Misleading Emphasis on Quantitative Effects.” Pp. 391–406 in The Handbook of Experimental Economic Methodology, edited by Fréchette

G. R.

Schotter

. New York: Oxford University Press.

60.

Keuschnigg

Marc

Bader

Felix

Bracher

Johannes

. 2016. “Using Crowdsourced Online Experiments to Study Context-dependency of Behavior.” Social Science Research 59:68–82.

61.

Kocher

Martin

Cherry

Todd

Kroll

Stephan

Netzer

Robert J.

Sutter

Matthias

. 2008. “Conditional Cooperation on Three Continents.” Economic Letters 101:175–78.

62.

Larney

Andrea

Rotella

Amanda

Barclay

Pat

. (2019). “Stake Size Effects in Ultimatum Game and Dictator Game Offers: A Meta-analysis.” Organizational Behavior and Human Decision Processes 151:61–72.

63.

Levitt

Steven D.

List

John A.

. 2007. “What Do Laboratory Experiments Measuring Social Preferences Reveal about the Real World?” Journal of Economic Perspectives 21:153–74.

64.

Martin

Michael W.

Sell

Jane

. 1979. “The Role of the Experiment in the Social Sciences.” Sociological Quarterly 20:581–90.

65.

Morgan

Stephen L.

Winship

Christopher

. 2015. Counterfactuals and Causal Inference: Methods and Principles for Social Research. 2nd Ed. New York: Cambridge University Press.

66.

Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349:943–51.

67.

Pearl

Judea

Bareinboim

Elias

. 2014. “External Validity: From Do-calculus to Transportability Across Populations.” Statistical Science 29:579–95.

68.

Peterson

Robert A.

2001. “On the Use of College Students in Social Science Research: Insights from a Second-order Meta-analysis.” Journal of Consumer Research 28:450–61.

69.

Potters

Jan

Winden

Frans van

. 2000. “Professionals and Students in a Lobbying Experiment Professional Rules of Conduct and Subject Surrogacy.” Journal of Economic Behavior & Organization 43:499–522.

70.

Rand

David G.

2012. “The Promise of Mechanical Turk: How Online Labor Markets Can Help Theorists Run Behavioral Experiments.” Journal of Theoretical Biology 299:172–79.

71.

Rand

David G.

Peysakhovich

Alexander

Kraft-Todd

Gordon T.

Newman

George E.

Wurzbacher

Owen

Nowak

Martin A.

Greene

Joshua D.

. 2014. “Social Heuristics Shape Intuitive Cooperation.” Nature Communications 5:3677.

72.

Rauhut

Heiko

Winter

Fabian

. 2010. “A Sociological Perspective on Measuring Social Norms by Means of Strategy Method Experiments.” Social Science Research 39:1181–94.

73.

Reips

Ulf-Dietrich

. 2002. “Standards for Internet-based Experimenting.” Experimental Psychology 49:243–56.

74.

Roth

Alvin E.

Prasnikar

Vesna

Okuno-Fujiwara

Masahiro

Zamir

Shmuel

. 1991. “Bargaining and Market Behavior in Jerusalem, Ljubljana, Pittsburgh, and Tokyo: An Experimental Study.” American Economic Review 81:1068–95.

75.

Rubin

Donald B.

2008. “For Objective Causal Inference, Design Trumps Analysis.” Annals of Applied Statistics 2:808–40.

76.

Schram

Arthur

. 2005. “Artificiality: The Tension between Internal and External Validity in Economic Experiments.” Journal of Economic Methodology 12:225–37.

77.

Shadish

William R.

Cook

Thomas D.

Campbell

Donald T.

. 2002. Experimental and Quasi-experimental Designs for Generalized Causal Inference. Boston, MA: Houghton Mifflin.

78.

Thye

Shane R.

2014. “Logical and Philosophical Foundations of Experimental Research in the Social Sciences.” Pp. 53–82 in Laboratory Experiments in the Social Sciences, edited by Webster

Jr Sell

. Burlington, MA: Academic Press.

79.

Webster

Murray

Sell

Jane

. 2014. “Why Do Experiments?” Pp. 5–21 in Laboratory Experiments in the Social Sciences, edited by Webster

Jr Sell

. Burlington, MA: Academic Press.

80.

Willer

David

Walker

Henry A.

. 2007. Building Experiments: Testing Social Theory. Stanford, CA: Stanford University Press.

81.

Zelditch

Morris

Jr. 2014. “Laboratory Experiments in Sociology.” Pp. 183–97 in Laboratory Experiments in the Social Sciences, edited by Webster

Jr Sell

. Burlington, MA: Academic Press.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

2.06 MB

0.03 MB