Abstract
Morality-based interventions designed to promote academic integrity are being used by educational institutions around the world. Although many such approaches have a strong theoretical foundation and are supported by laboratory-based evidence, they often have not been subjected to rigorous empirical evaluation in real-world contexts. In a naturalistic field study (N = 296), we evaluated a recent research-inspired classroom innovation in which students are told, just prior to taking an unproctored exam, that they are trusted to act with integrity. Four university classes were assigned to a proctored exam or one of three types of unproctored exam. Students who took unproctored exams cheated significantly more, which suggests that it may be premature to implement this approach in college classrooms. These findings point to the importance of conducting ecologically valid and well-controlled field studies that translate psychological theory into practice when introducing large-scale educational reforms.
Academic dishonesty is a common problem in schools and universities around the world (Cizek, 1999). It creates an atmosphere that devalues learning, thus undermining the central mission of educational institutions. Extensive research has identified key psychological and social processes that underlie academic integrity and dishonesty (for reviews, see Anderman & Murdock, 2011, and Cizek, 1999). These findings have inspired new approaches to discouraging cheating on exams.
One such approach combines an unproctored exam with an honor code (Koch, 2000). It was pioneered in the United States and has been implemented worldwide (Covey et al., 1989). Proponents of this approach have argued that it can effectively reduce cheating by emphasizing students’ role in upholding academic integrity and by promoting a sense of trust between students and teachers (Mazar et al., 2008; McCabe et al., 2001). Unfortunately, the empirical evidence in support of this approach has been based entirely on students’ self-report data (for a review, see McCabe et al., 2001), which is not always a reliable indicator of academic cheating (e.g., Mazar & Ariely, 2006; McCabe, 1992). As noted by Cizek (1999), students at schools that use morality-based interventions may be especially hesitant to admit to cheating, even on anonymous surveys. Thus, because of a lack of rigorous empirical evaluation, little is known about whether this method can indeed reduce students’ actual cheating behavior in real-world settings. In the present field research, we addressed this important gap in the literature. This is especially relevant during the current COVID-19 pandemic, given that millions of students around the world have suddenly shifted to taking classes that are conducted exclusively online, which means that exams are frequently left unproctored (see Appiah, 2020).
The practice of giving unproctored exams was inspired by the theory of self-concept maintenance, which posits that as a consequence of socialization processes, people tend to internalize the moral value of honesty, and they are highly motivated to view themselves as honest so as to maintain a positive self-concept (e.g., Mazar et al., 2008). According to this view, people tend to compare their own behavior with internalized moral standards and strive to avoid the aversive conclusion that they have fallen short. External reminders of these standards can serve to prime ethical behavior whenever people are tempted by benefits that they think could be obtained by cheating (Grym & Liljander, 2017). The strongest evidence for this theory comes from research in behavioral economics, which has revealed that when moral principles are emphasized, individuals are less likely to cheat on laboratory tasks (e.g., when being paid to solve puzzles; Grym & Liljander, 2017; but see Kristal et al., 2020). However, it is unclear whether these findings would generalize to real-world classroom settings.
In the present work, we took a first step toward examining this issue by testing the assumption that one particular form of unproctored exam, the so-called trust exam, is an effective way to reduce cheating. The central feature of a trust exam is the trust component, in which an instructor prefaces an unproctored exam by stating that he or she trusts the students to uphold academic integrity by refraining from cheating. This approach, which was pioneered in the United States, is based on academic honor codes that have been widely used in many countries. We capitalized on a unique opportunity afforded by the recent implementation of the trust-exam approach at one particular university in China, where it is currently in use for about a quarter of the exams that are given each year.
In the present research, we assessed the effectiveness of the trust-exam approach in comparison with the traditional proctored-exam method that has been in worldwide use since its invention in ancient China in the 7th century C.E. (Miyazaki, 1981). We conducted a field experiment in which students in four different sections of the same undergraduate course took the same exam but under four different conditions. In the traditional-exam condition, the instructor followed the standard practice of staying in the room while the class took the exam. This condition served as a baseline.
Statement of Relevance
Academic cheating is a worldwide problem that is of interest to educators, policy makers, and researchers across many behavioral-science disciplines. Prior laboratory research has suggested promising techniques to nudge students not to cheat, but these techniques have yet to be tested in real-world contexts. Nevertheless, inspired by such research, many universities in the United States and around the world have implemented an educational innovation that involves the instructor telling students, just before they take an unproctored exam, that he or she trusts them to uphold academic integrity by refraining from cheating. We directly tested the effectiveness of this intervention in a real-life exam context and found that three variants of an unproctored-exam format led to significantly more cheating, compared with a traditional proctored exam. Our findings point to the importance of conducting ecologically valid and well-controlled field studies before translating psychological theory into practice, in order to foster effective evidence-based educational reforms.
In the first of three experimental conditions, the collective-punishment trust-exam condition, the instructor told the students that he trusted them not to cheat (the trust component), and then he left the room. This form of the trust-exam implementation also included a collective-punishment component, in which the students were told that the entire class would receive a score of zero on the exam if anyone was caught cheating. This combination of a trust component and a collective-punishment component is the version of the trust exam that is in use at the university where the present research was conducted (for evidence that collective punishment can promote compliance, see Gao et al., 2015, and Heckathorn, 1988).
The other two experimental conditions also included the trust component, but we varied the punishment component to isolate its effect. Specifically, in the individual-punishment trust-exam condition, the instructor told the class that any student who was caught cheating would receive a score of zero on the exam, and there was no reference to collective punishment. The no-punishment trust-exam condition included the trust component only, and there was no threat of punishment of any kind. This condition is the most similar to unproctored-exam approaches that are commonly in use in the United States and elsewhere (Cizek, 1999).
To measure cheating, we worked with the instructors to add five prohibitively difficult course-related target questions to the exam. The target questions were in a fill-in-the-blank format, and their difficulty level was established in a pilot study with a separate sample of students from the same university (for details, see the Method section). Students were able to correctly answer the target questions only if they cheated by surreptitiously accessing an online class portal to check an answer key that was posted at the outset of the exam. We treated each correct response to a target question as evidence that the student had cheated.
On the basis of the existing evidence that inspired the trust-exam intervention (Grym & Liljander, 2017; Mazar et al., 2008; McCabe & Treviño, 1993), we hypothesized that the three trust-exam conditions would lead to significantly less cheating than the traditional-exam format. Regarding the three variants of the trust-exam approach, we expected the two conditions that included a threat of punishment to be less effective in reducing cheating because they would implicitly communicate the message that students cannot be trusted, which is contrary to the spirit of the trust component (Houser et al., 2008). In contrast, we expected students in the no-punishment trust-exam condition to cheat the least of the three trust-exam conditions because this method adhered most closely to the principles that motivated the introduction of the unproctored-exam approach (Cizek, 1999).
Method
Participants
The research was approved by the research ethics committee at the Institutes of Psychological Sciences, Hangzhou Normal University, People’s Republic of China. Participants were 296 first-year undergraduate students (206 women) at a university in eastern China. They were enrolled in four different sections of the same introductory psychology course, which is a required general-education class for first-year undergraduates. The four sections were taught by four different male instructors, and each student was randomly assigned to one of the four conditions: 71 in the traditional-exam condition (37 women), 81 in the collective-punishment trust-exam condition (64 women), 82 in the individual-punishment trust-exam condition (55 women), and 62 in the no-punishment trust-exam condition (50 women). Because this was a natural study, the sample sizes were determined by the natural size of each class, and this led to an uneven number of participants across the four conditions. Because there were no previous studies that allowed us to calculate effect sizes, we included all of the students in each of the four sections. At this university, more than 90% of students were Han Chinese, most of the students came from a middle-class background, and the ages of first-year students ranged from 18 to 21 years.
Materials
A single exam was prepared for use in all four conditions. This exam was on a topic that the instructors had recently covered in class, and it was similar in form and content to the exams that the students had taken previously. It consisted of 10 fill-in-the-blank questions and two essay questions. Of the 10 fill-in-the-blank questions, five were easy questions on topics that had already been covered by the instructors. The other five target questions were prohibitively difficult, as established by a pilot study (see below), and students could answer them correctly only by surreptitiously logging onto the class portal to view the posted answer key during the exam. Any correct answers to these target questions were taken as evidence that the student had cheated on the exam.
The five target questions were selected on the basis of a pilot study with 64 additional students who were taking the same course but did not participate in this study. These students were given a set of 25 course-related questions to answer and were then asked to rate the difficulty of each question using a 7-point Likert-type scale that ranged from 1 (not at all difficult) to 7 (extremely difficult). Among these 25 questions, only those that all students got wrong and that all students rated at an average of 6 or greater on the difficulty scale were selected for use in the current study. The complete set of target questions appears below; correct answers are in italics.
The level of brain activity among people with depressive symptoms is relatively low, especially at the site of the _____ when the patient is in a resting state. frontal lobes
Two-thirds of people who are diagnosed with _____ are women. generalized anxiety disorder
The number of people who are sent to psychiatric hospitals in the United States each year is approximately _____. 2,000,000
Approximately _____ percent of people who suffer from mental disorders are not dangerous to others. 90
Gender differences in _____ are a significant contributor to gender differences in the prevalence of depression. cognitive style
Procedure
Students were given 15 min to take the exam. At the beginning of the exam, as is the usual practice, each instructor told the students to turn off all of their electronic devices, put away their textbooks and other learning materials, and refrain from accessing any of these items during the exam. The instructor then distributed the exam sheets and informed the students that he had uploaded the answer key to the online class portal (a QQ group, which is currently a widely used social media platform that instructors typically use to post learning materials, assignments, and test results; see https://www.imqq.com/). Specifically, the instructor said, Given that it will be difficult to cover all the course material before the final exam, we will not discuss the answers to this exam as we usually do in class. However, I have just posted the answer key on our class portal, so when the exam is over you can download the answer key and check the answers by yourself. Remember, you are not allowed to look at the answer key during the exam; you are only allowed to look at it after the exam.
What the instructor did next varied by condition. In the traditional-exam condition, the instructor was present during the exam and proctored it in the typical manner. No reference was made to any consequences that would occur if a student were caught cheating.
In the three trust-exam conditions, the instructor administered an unproctored exam by first announcing to the class that he trusted them to not cheat. In the collective-punishment trust-exam condition, the instructor then warned the class that there would be a collective punishment if any student were caught cheating (i.e., all students in the class would receive a zero on the test). In the individual-punishment trust-exam condition, the instructor warned the class that there would be an individual punishment if any student were caught cheating (i.e., that particular student would receive a zero on the test). Finally, in the no-punishment trust-exam condition, the instructor did not give any type of warning to the class. In the three trust-exam conditions, the instructor exited the room after giving the instructions and did not return until the end of the allotted 15-min period. After returning to the room, the instructor collected the exam sheets.
Ideally, a control condition would contain the trust reminder as well as the proctoring component. However, we chose to use a traditional-exam format without a trust reminder because it is the most commonly used exam format in Chinese universities and because to the best of our knowledge, no university in China conducts proctored exams that include a trust reminder. This choice allowed us to establish an ecologically valid control condition for comparison.
There were two dependent measures of cheating. The first, cheating occurrence, was a dichotomous measure that classified students’ behavior as either “cheating” or “not cheating” on the basis of whether any of their answers to any of the target questions matched the corresponding answer in the online answer key. For the students who cheated, there was also a count measure called cheating extent, which was the number of target questions that the student cheated on. Possible values for the cheating-extent measure ranged from 1 to 5.
Two research assistants independently scored each student’s answers to all questions, including the five target questions that were used to assess cheating. The research assistants were blind to the study hypotheses, between-subjects conditions, and identities of the students. The intercoder reliability was 100%.
The research assistants returned students’ final marks to the instructors without the scores for the target questions to ensure that the instructors would not be able to find out which students had cheated on the exam.
One week later, each class was given a short survey in which the students were asked, “Have you cheated on any exams in the last month?” and “Have you helped any others cheat on an exam in the last month?” The response options for each of these questions were “yes” and “no.” After the students completed the survey, the instructors debriefed them and explained that their responses to the target questions would not be counted toward their scores.
Results
Cheating occurrence
The results of the cheating-occurrence measure for each of the four conditions are shown in Figure 1 and Table 1. We conducted a binary logistic regression analysis on cheating occurrence (0 = no cheating, 1 = cheating). The preliminary analysis with condition, participant gender, and their interaction as the predictors revealed that the gender-by-condition interaction was not significant. Thus, a more parsimonious model was chosen by including only the main effects of gender and condition (Menard, 2001).

Percentage of participants who cheated in the traditional-exam condition and in each of the three trust-exam conditions. Error bars represent 95% confidence intervals. Asterisks indicate significant differences between conditions (*p < .01, **p < .001).
Cheating Rate and Cheating Extent in Each of the Four Conditions
Note: For each condition, the table shows the overall percentage of participants who cheated as well as the breakdown of participants who cheated on one or more of the five questions.
We found this parsimonious model to be significant, χ2(4, N = 296) = 92.57, p < .001, −2 log likelihood = 287.40, Nagelkerke R2 = .37. Inspection of the model revealed that the gender effect was significant, β = 0.67, SE = 0.34, Wald χ2(1) = 3.85, p = .0499, odds ratio (OR) = 1.96, 95% confidence interval (CI) = [1.00, 3.84]. Men were significantly more inclined to cheat than women regardless of condition (for related findings, see Whitley et al., 1999).
The main effect of condition was also significant, Wald χ2(3) = 66.40, p < .001. A priori comparisons with the traditional-exam condition as reference showed that the cheating rate in this condition (31.0%) was significantly lower than the rate in any of the three trust-exam conditions (53.1%, 91.5%, and 88.7%, respectively, for the collective-punishment trust-exam, individual-punishment trust-exam, and no-punishment trust-exam conditions)—traditional exam vs. collective-punishment trust exam: β = 1.13, SE = 0.36, Wald χ2(1) = 9.67, p < .01, OR = 3.10, 95% CI = [1.52, 6.32]; traditional exam vs. individual-punishment trust exam: β = 3.33, SE = 0.49, Wald χ2(1) = 46.87, p < .001, OR = 28.03, 95% CI = [10.80, 72.80]; traditional exam vs. no-punishment trust exam: β = 3.10, SE = 0.50, Wald χ2(1) = 38.44, p < .001, OR = 22.21, 95% CI = [8.34, 59.20]. Thus, compared with the traditional-exam condition, all three variants of the unproctored trust-exam approach significantly increased students’ tendency to cheat rather than reducing it.
Follow-up comparisons among the three trust-exam conditions showed that the cheating rate in the collective-punishment trust-exam condition was significantly lower than in the individual-punishment trust-exam condition, β = 2.20, SE = 0.46, Wald χ2(1) = 23.35, p < .001, OR = 9.04, 95% CI = [3.70, 22.09] and no-punishment trust-exam condition, β = 1.97, SE = 0.46, Wald χ2(1) = 18.22, p < .001, OR = 7.17, 95% CI = [2.90, 17.70]. However, the cheating rates in the individual-punishment and no-punishment trust-exam conditions were not significantly different from each other (p = .681). These results suggest that when the exam was unproctored, the collective-punishment trust-exam approach led to significantly less cheating than the other two variants of the trust-exam approach, which involved individual punishment or no punishment at all.
Cheating extent
The results of the cheating-extent measure for each condition are shown in Table 1. As shown in the table, the most striking contrast was between the individual-punishment and no-punishment trust-exam conditions, in which a majority of participants who cheated did so on all five questions (82.6% and 87.2%, respectively), compared with the traditional-exam condition, in which the majority (77.3%) of participants cheated on only one question.
We then conducted regression analyses to test these observations statistically. Because the overall distribution of the cheating extent was heavily skewed to the right, we used the zero-inflated negative binomial regression model in the Generalized Linear Models module of SPSS to analyze the data. Cheating extent was the predicted variable. The preliminary analysis with condition, participant gender, and their interaction as the predictors revealed that the gender-by-condition interaction was not significant. Thus, a more parsimonious model was chosen by including only the main effects of gender and condition (Menard, 2001).
We found that this parsimonious model was significant, χ2(4) = 18.44, p = .001, suggesting that after analyses accounted for the potential zero inflation, participants’ cheating extent was still significantly predicted by the main predictors. Inspection of the model revealed that the condition effect was significant, χ2(3) = 19.22, p < .001, but the gender effect was not, χ2(1) = 0.03, p = .870.
A priori comparisons with the traditional-exam condition as reference revealed that participants cheated to a greater extent in all three unproctored-trust-exam conditions than in the traditional-exam condition—collective-punishment trust-exam condition vs. traditional-exam condition: β = 0.81, SE = 0.33, Wald χ2(1) = 5.83, p < .05, OR = 2.24, 95% CI = [1.16, 4.30]; individual-punishment trust-exam condition vs. traditional-exam condition: β = 1.22, SE = 0.31, Wald χ2(1) = 15.56, p < .05, OR = 3.39, 95% CI = [1.85, 6.21]; no-punishment trust-exam condition vs. traditional-exam condition: β = 1.25, SE = 0.32, Wald χ2(1) = 14.99, p < .05, OR = 3.48, 95% CI = [1.85, 6.53].
Post hoc pairwise comparisons among the three trust-exam conditions showed that participants cheated to a lesser extent in the collective-punishment condition than in the individual-punishment trust-exam condition, Wald χ2(1) = 3.90, p = .048, and the no-punishment condition, Wald χ2(1) = 3.59, p = .058. No significant difference was found between the collective-punishment and no-punishment trust-exam conditions, Wald χ2(1) = 0.02, p = .896.
These results showed that participants in the three trust-exam conditions cheated to a greater extent than those in the traditional-exam condition. Further, the participants in the individual-punishment and no-punishment trust-exam conditions cheated to a greater extent than the participants in the collective-punishment trust-exam condition, either significantly or marginally significantly.
Self-reported cheating
Of the 296 participants who were present for the exam session, 11 were not in attendance on the day that the survey was administered. Of the remaining 285 participants, 24 did not complete the survey and thus were excluded from subsequent data analyses.
Among the remaining 261 participants, only 11 reported having cheated or having helped other people cheat in the past month. Among these participants, 2 (3.0%) were in the traditional-exam condition, 2 (2.7%) were in the collective-punishment trust-exam condition, 3 (4.1%) were in the individual-punishment trust-exam condition, and 4 (8.2%) were in the no-punishment trust-exam condition. Chi-square analysis revealed the condition effect to be nonsignificant, χ2(3) = 2.52, p = .472. Further, the rate of self-reported cheating in each of the four conditions was significantly lower than what would be expected on the basis of the actual cheating rates in these conditions, χ2(1) = 24.12 for the traditional-exam condition, χ2(1) = 74.30 for the collective-punishment trust-exam condition, χ2(1) = 713.44 for the individual-punishment trust-exam condition, and χ2(1) = 317.40 for the no-punishment trust-exam condition, all ps < .001.
Discussion
In the present field study, we used a naturalistic methodology to evaluate the effectiveness of a research-inspired classroom intervention that was designed to reduce academic cheating. This intervention is called the trust exam, and it is based on the theory of self-concept maintenance, which characterizes people as wanting to view themselves as moral agents whose behavior is motivated by internalized norms and values (Mazar et al., 2008). This theory has led to widespread morality-based classroom interventions in which the instructor makes an appeal to students’ moral integrity prior to administering an unproctored exam. The present study provides the first real-world experimental evaluation of the effectiveness of the trust-exam approach for reducing students’ actual cheating behavior.
Our primary hypotheses were proven wrong. First, students cheated significantly more in all three variants of the trust exam than in the traditional-exam condition. The cheating rate averaged 76.9% in the three trust-exam conditions, compared with 31.0% in the traditional-exam condition. These results suggest that the trust-exam approach is not an effective alternative to the traditional proctored-exam format for reducing cheating. The results also raise questions about the effectiveness of appealing to students’ moral self-concepts to combat cheating, especially as more students take exams from their homes during the COVID-19 pandemic.
We also hypothesized that administering a trust exam without a threat of punishment would be more effective in reducing cheating than either of the two trust-exam variants that contained a threat of punishment. However, we found that the no-punishment trust-exam condition actually produced the second highest cheating rate. We also found that the cheating rate and the cheating extent in the individual-punishment trust-exam condition were not significantly different from those in the no-punishment trust-exam condition.
The collective-punishment threat did seem to have a deterrent effect, producing a significantly lower cheating rate and cheating extent than the other two trust-exam conditions. Even so, this finding does not support the continued use of the standard trust-exam approach because it still led to a significantly higher cheating rate than the traditional-exam approach. Moreover, punishing innocent parties violates basic principles of fairness and justice (Lipnowski, 1993), and threats of collective punishment may have even less of a deterrent effect in societies that place less emphasis on the collective good (see Markus & Kitayama, 2003).
An additional finding was that self-reports of students’ cheating behavior did not correspond to their actual cheating behavior: Averaged across the four conditions, 65.9% of participants cheated on the exam, but only 4.2% admitted to having cheated in the month during which the exam sessions were conducted. This finding raises questions about the validity of using self-report measures to assess students’ real-world cheating behavior (see Bucciol & Montinari, 2019).
The present findings should not be taken to mean that there is no place for morality-based intervention in the classroom. However, they do suggest that a culture of trust and reciprocity cannot be imposed by simply telling students that they are expected to act in accordance with these values. One reason for this is that instructors’ moralistic pronouncements do not necessarily invoke morality in the mind of students. As suggested by Nucci and Powers (2014), whether students view cheating as a moral issue may depend on the nature of the course they are taking. Specifically, when the stakes are high, students may view cheating as an intelligent, self-protective act rather than an immoral one. If students take tests that are disconnected from learning, it could further decrease the likelihood that moral concerns will be elicited. To address these possibilities, researchers could conduct future studies in a way that links individual students’ cheating behavior to their self-reported beliefs about cheating, provided it could be done in an ethical way.
It should be noted that our control condition contained a proctored-exam component without a trust-reminder component, to create a practice-as-usual comparison. The disadvantage of this approach is that it involved varying two dimensions at the same time (i.e., whether the exam was proctored and whether there was a trust reminder), which did not allow us to isolate the effects of these factors. Thus, future research on this topic should include a control condition that has both a proctored-exam component and a trust-reminder component. It is possible that such a combination would lead to even less cheating than was seen in the control condition of the present study.
Additional research will also be needed to further pin down how the trust-exam approach affects academic cheating in different contexts. For example, it is possible that it would be more effective in classes that students are intrinsically interested in, compared with classes that students take merely to fulfill a requirement (Murdock & Anderman, 2006). Other factors are likely to make a difference, such as whether classes are graded on a curve or the extent to which the university has a culture that emphasizes academic integrity (McCabe & Treviño, 1993).
In summary, the present field study showed that a morality-based academic intervention rooted in the theory of self-concept maintenance actually led to a significant increase in cheating. Taken together, these findings raise questions about the effectiveness of morality-based academic-integrity interventions in classroom settings. Further, the present work speaks to broader concerns about bridging the gap between theory and practice, and it points to the importance of conducting ecologically valid and experimentally well-controlled field studies to foster effective evidence-based educational reforms.
Footnotes
Transparency
Action Editor: Paul Jose
Editor: Patricia J. Bauer
Author Contributions
L. Zhao, J. Zheng, and H. Mao are co–first authors. L. Zhao, K. Lee, J. Ye, and H. Chen developed the study concept. X. Yu, J. Ye, H. Chen, J. Zheng, and H. Mao conducted testing and data collection. J. Zheng and H. Mao analyzed and interpreted the data under the supervision of L. Zhao. K. Lee, L. Zhao, G. D. Heyman, B. J. Compton, J. Zheng, and H. Mao drafted the manuscript and provided critical revisions. All the authors approved the final manuscript for submission.
