Abstract
Phishing emails pose a serious threat to individuals and organizations. Users’ ability to identify phishing emails is critical to avoid becoming victims of these attacks. The current study examined the effectiveness of a short online phishing training program designed to help users identify phishing emails. Half of the participants were in the training group and the other half worked on a control filler task. The training group’s sensitivity (d′) at correctly classifying emails as legitimate or phishing increased by 1.14 whereas the control group’s sensitivity increased by only 0.48. This difference in d' changes was significant, t(38) = 2.05, p = .048. This improvement in performance was likely due to users learning how to check reliable cues and interpret them. Despite a sizeable improvement in detecting phishing emails, the training group correctly classified only about two-thirds of phishing emails. Accordingly, a short training program appears beneficial, but a more comprehensive training program would be needed to reduce vulnerability to an acceptable level.
Keywords
Introduction
Phishing attacks are attempts to fraudulently obtain confidential information from users through means of social engineering or technical subterfuge (Anti-Phishing Working Group [APWG], 2019). In 2019, these phishing attacks resulted in 114,000 victims and $57 million in losses (Internet Crime Complaint Center, 2019) with phishing emails being the most common vector for phishing attacks (Parsons et al., 2015). The goal of such attacks is to manipulate users into divulging confidential information by tricking them into believing they are dealing with a legitimate organization or individual when they are actually not (APWG, 2019; Krombholz et al., 2015; Orgill et al., 2004; Xiong et al., 2017). In a typical phishing attack, a user follows a link in an email message that takes them to a site that looks like a legitimate site but is actually a copy designed to trick the user into entering their login credentials in order to steal them. Phishing can also include technical subterfuge such as planting malicious software to steal confidential information if the user opens an attachment. These kinds of attacks are on the rise (APWG, 2019) and pose a serious threat to both individuals and organizations because they can result in many adverse consequences including financial loss and identity theft (Xiong et al., 2017).
Although there are some automated tools such as anti-phishing software that detects fraudulent login interfaces (Cao et al., 2008) and machine learning algorithms that detect phishing emails (Fette et al., 2007), these tools cannot protect users against all phishing attacks (Proctor & Chen, 2015; Xiong et al., 2017). Ultimately, the final decision of whether an email is safe lies with the user and not automated phishing detection tools (Proctor & Chen, 2015). Therefore, users must be able to identify phishing emails in order to avoid being victims of phishing attacks.
Emails can contain cues that indicate whether an email is safe or unsafe, with some cues being more reliable than others. If users base their judgments about an email’s authenticity on more reliable cues, then this should enable users to be better able to distinguish legitimate emails from phishing emails because more reliable cues accurately reflect whether an email is fraudulent more consistently than less reliable cues. Phishing emails often reveal that they are fraudulent by having spelling or grammar errors, a poorly copied logo, or an unprofessional overall layout of the email. However, an email that has perfect spelling and grammar is not necessarily a safe email just as an email with spelling or grammar errors is not necessarily fraudulent. A more reliable cue is whether the domain name of the sender’s email address appears normally and refers to the correct domain. For example, an email from johndoe@gnail.com comes from the domain gnail.com, which is an imitation domain of the legitimate, well-known domain gmail.com. This is strongly indicative of a fraudulent email. However, the sender’s email address is not a completely reliable cue because it is possible to spoof an email address. Therefore, even if the sender’s email address appears normal and unaltered, it could be that the sender’s email address was spoofed. A second way in which an email with a correct address could be a fraudulent email is that the sender’s computer has been hacked, and the computer sends out phishing emails to the user’s contact list appearing to come from the sender. An early example is the ILOVEYOU virus (Broadhurst, 2017), although it involved malware not phishing. In brief, if an email address appears altered, it is a reliable cue the email is fraudulent, but if the email address appears unaltered and normal, it does not reliably indicate whether the email is legitimate or fraudulent.
A cue that reliably indicates whether a link in an email is legitimate is the domain name within the uniform resource locator (URL; Xiong et al., 2017). For example, a safe email may contain a link to the URL https://www.capital one.com/, and in this case the domain is capitalone.com, which may be the website of the user’s bank. However, an unsafe email may contain a link to the URL https://www.capital-one.com/. Although this domain may appear safe, it is masquerading as a link to capitalone.com and is therefore unsafe. An unsafe email may also have the correct link text but actually link to a fraudulent site.
Xiong et al. (2017) examined whether drawing a participant’s attention to reliable cues affects their judgments of the safety (i.e., legitimacy) of a web page (Xiong et al., 2017). To make reliable cues more salient, participants were explicitly instructed to look at the address bar of the webpage, and the domain name in the URL was highlighted. They found that most users did not know how to use the domain name of the URL to identify phishing web pages. Thus, simple instructions and highlighting techniques were not sufficient for participants to be able to identify phishing. For this reason, Xiong et al. (2017) suggested training users on how to use the domain name (a reliable cue) to identify phishing is essential.
PhishGuru (Kumaraguru et al., 2007a, 2007b) and Anti-Phishing Phil (Sheng et al., 2007) are two notable anti-phishing training tools. In PhishGuru, participants received one (Kumaraguru et al., 2007b) or two (Kumaraguru et al., 2007a) training emails, which emulated phishing emails, that provided feedback in the form of a comic strip when they clicked on a link. Specifically, they were shown a comic strip of a potential phishing victim about to fall for a phishing attack, but the PhishGuru character stops the potential victim from falling for the scam and advises them to follow specific advice when reading emails: (a) never click on links in email and instead navigate to websites on their own with their browser, (b) call the company, (c) never give out personal information when requested by email, and (d) be on the lookout for suspicious websites. PhishGuru was shown to be an effective method to stop users from clicking on links in subsequent phishing emails (Kumaraguru et al., 2007a) for at least one week after training (Kumaraguru et al., 2007b). Anti-Phishing Phil (Sheng et al., 2007) engages participants in an online game in which participants evaluated a URL to determine whether it was real. Advice on how to distinguish legitimate and fraudulent URLs were provided to participants. After 15 minutes of gameplay, participants were better able to identify fraudulent website URLs than were participants who did not play the game. Even though the training portion for both Phish Guru (one or two training emails) and Anti-Phishing Phil (15 minutes of gameplay) were short, they were both considered effective (Kumaraguru et al., 2007a, 2007b; Sheng et al., 2007). The efficacy of these trainings was, in part, attributed to the application of learning principles such as practicing the task to be learned (i.e., learning-by-doing), receiving immediate feedback, and receiving conceptual knowledge (e.g., mental representation) interspersed with procedural knowledge (e.g., steps to achieve a goal) (Kumaraguru et al., 2010).
An alternative format for an anti-phishing training tool is an online phishing quiz, which Kumaraguru et al. (2010) suggested could be effective. Perrault (2018) found taking an online phishing quiz increased the participants’ perception of the threat phishing emails pose, perception of their own susceptibility, intention to learn more about phishing, and intention to discuss phishing with others. These effects were attributed to participants being initially overconfident in their ability, and when they were shown their score on the quiz (M = 64%), they became less confident in their abilities. Then after participants read about what cues indicated the emails were phishing, they regained their confidence in their ability to recognize phishing emails. However, Perrault did not measure phishing email identification accuracy before and after the online phishing quiz. Therefore, it is unclear whether this newfound confidence was justified with increased ability. Similarly, Werner and Courte (2010) found participants felt more prepared to recognize phishing emails after taking an online phishing quiz. However, they did not measure phishing email identification accuracy either. Although prior research showed online phishing quizzes resulted in participants’ attitudes and behavioral intentions changing, no prior research has examined whether taking an online phishing quiz actually results in an improved ability to identify phishing emails.
One of the most promising online phishing quizzes is a short, 8-question quiz by Jigsaw (n.d.). This online phishing quiz applies many of the learning science principles that were used in the PhishGuru and Anti-Phishing Phil training tools (Kumaraguru et al., 2010) including learning-by-doing, providing immediate feedback, and interspersing conceptual knowledge and procedural knowledge. Additionally, the Jigsaw quiz focuses on training users to use reliable cues as suggested by Xiong et al. (2017). The Jigsaw quiz is different from these prior anti-phishing training tools because it is an online phishing quiz and it teaches users how to determine whether a link leads to a phishing website. Methods to determine this include procedural knowledge such as to move the cursor over a link to preview the URL rather than simply instructing users to avoid clicking links in all emails. This procedural knowledge is especially critical for a phishing training because prior research shows that 80% of users are not familiar with how to preview the URL of a link (Kumaraguru et al., 2007a). Despite this, neither PhishGuru nor Anti-Phishing Phil teach the procedural knowledge needed for previewing a linked URL. The current study tested the effectiveness of the Jigsaw online phishing quiz for training users to identify phishing emails. In particular, we were interested in whether and to what degree training users to check and interpret reliable cues would improve their subsequent ability to identify phishing emails.
Method
Participants
Forty students (24 female) from Rice University participated for partial course credit, which was a requirement for their introductory psychology course. Ages ranged from 18 to 23 years (M = 19.25, SD = 1.17). The vast majority (n = 37) reported using Gmail to send and receive emails, the same email environment used to evaluate emails in the current study. Participants reported using the internet for 6.53 hours per day, on average. This research complied with the American Psychological Association Code of Ethics and was approved by the Institutional Review Board at Rice University. Informed consent was obtained from each participant.
Apparatus and Email Stimuli
Participants evaluated emails in the web version of Gmail on a Firefox web browser (version 60.6.1esr, 32-bit) while logged in to an account set up specifically for this study. Both the legitimate and phishing emails came from a variety of senders including companies (e.g., Apple, Uber, Capital One) and individuals (e.g., people affiliated with the university). For example, an email from Apple notified users of a recent purchase and prompted users to click a link if it was not them. Participants never saw the same email or sender more than once. The emails were randomly assigned to either a pre-test or post-test with the constraint that each had to have eight legitimate and eight phishing emails.
The legitimate emails were created by taking legitimate emails sent to one of the authors, modifying the body of the email slightly, and then sending it to the study-specific email address using an email spoofing website. There were a few modifications we made to the body of the emails. First, we changed any dates and times within the body of the original email so that they corresponded with when the email was sent to the study-specific email address. Second, we changed any mention of the original recipient’s email address to the study-specific email address. Last, we removed any reference to the original recipient’s name in the body of the email because we wanted participants to evaluate these emails as if they actually received them and mentioning another’s name would break the illusion that these emails could be their own. The same procedure was used to create the phishing emails, but they were modified further to make them appear as phishing emails. In particular, we changed the email domain of the sender to something different from the legitimate organization’s email domain and the links within the body of the email to reference a URL different from the legitimate organization’s website. This method of taking legitimate emails from companies and then modifying it produces high-quality phishing emails because they are identical to the actual emails sent by companies with respect to the company’s visible branding, which is one way scammers try to make it seem like their phishing emails are from the purported sender (Drake et al., 2004).
Jigsaw Online Phishing Quiz
The Jigsaw online phishing quiz took participants through eight emails and prompted them to identify each email as phishing or legitimate. An example email is shown in Figure 1. Feedback about the accuracy of a response and the cues that indicated the email’s legitimacy (e.g., domain of sender’s email address, domain of linked URL) were provided following each response for both correct and incorrect responses. The Jigsaw quiz applied many of the learning science principles that were used in the PhishGuru and Anti-Phishing Phil training tools (Kumaraguru et al., 2010). For instance, it had participants practicing the task of identifying emails as legitimate or phishing (i.e., learning-by-doing) and provided immediate feedback on whether users correctly classified an email after each question. Conceptual knowledge was provided to the participant by defining phishing as “an attempt to trick you into giving up your personal information by pretending to be someone you know” (Jigsaw, n.d., start page), and procedural knowledge for how to identify an email as phishing was also given. For instance, it prompted users to “be sure to check out link URLs by hovering or using long presses, and to explore the email addresses” (Jigsaw, n.d., Question 1). Then, after the user had given their response, it informed the user that “mousing over this link or using a long press will show you that it goes to [an] insecure imitation domain” using a pop-up box close to the link referenced (Jigsaw, n.d., Question 1), conforming to the learning principle of contiguity (Clark & Mayer, 2016). This procedural knowledge was delivered in a conversational tone, conforming to the learning principle of personalization (Clark & Mayer, 2016). Over the course of the entire quiz, it showed variations for how scammers may try to trick recipients with imitation email addresses (e.g., no-reply@google.support), imitation linked URLs (e.g., http://drive–google.com/luke.johnson, https://drive.google.com.download-photo.sytez.net/AONh1e0hVP), and redirecting linked URLs (e.g., https://google.com/amp/tinyurl.com/y7u8ewlr).

Screenshot of First Question of Jigsaw Online Phishing Quiz.
Procedure and Design
Participants were randomly assigned to either the training group or control group with the constraint that an equal number were assigned to each group. All participants were first given a set of 16 emails (eight legitimate and eight phishing) as a pre-test. After the pre-test, participants in the training group completed the Jigsaw training quiz, and the participants in the control group completed a filler task (crossword puzzle) for approximately the same amount of time it takes to complete the Jigsaw online phishing quiz (5 minutes). Next, all participants were given a set of 16 emails (eight legitimate and eight phishing) as a post-test. For both the pre-test and post-test, participants were instructed to determine whether each email is authentic or falsified and verbally report their decision to the experimenter. The participant passed the trial if they correctly classified the email and failed the trial if they incorrectly classified the email, clicked on a link, or opened an attachment. The latter two behaviors were scored as a failure because a user should not engage in these behaviors before determining an email’s authenticity.
Participants were instructed to think aloud as they evaluated emails in the pre- and post-tests (i.e., verbal protocol; Kirwan & Ainsworth, 1992; Nemeth, 2004). For each trial, the experimenter noted participants’ verbal responses and physical behaviors in the Gmail environment. We were specifically interested in whether participants checked the sender’s email address or the URL of links in the emails because these are reliable cues. Accordingly, we considered these behaviors to be good behaviors. We used both verbal report (e.g., reading the email address aloud) and mouse behavior (e.g., hovering over a link so the URL pops up) as indications of these good behaviors.
Dependent Measures
We employed signal detection theory to evaluate participants’ performance because it can separate users’ sensitivity (i.e., their ability to determine if an email is phishing) from their response bias (i.e., their tendency to report an email was phishing or legitimate). This is important because training could cause a user to correctly identify more phishing emails by changing their criterion for saying an email is a phishing email (response bias) without changing their sensitivity. Accordingly, we calculated hit rate, false alarm rate, sensitivity (d′), and response bias (c) for each participant. In line with prior research, we treated phishing emails as signals plus noise and legitimate emails as noise (Canfield et al., 2016). This means d′ provided a measure of a user’s ability to identify phishing emails independent from their tendency to report that an email was phishing or legitimate (i.e., response bias). We used c to measure response bias (Canfield et al., 2016; Stanislaw & Todorov, 1999). In this context, a negative c reflected a tendency to identify an email as phishing when the user was uncertain, whereas a positive c reflected a tendency to identify an email as legitimate when the user was uncertain. To handle cases in which all emails were identified correctly or incorrectly, a log-linear correction was applied to calculate the hit rate and false alarm rate regardless of whether it was an extreme rate (Canfield et al., 2016; Hautus, 1995; Stanislaw & Todorov, 1999).
We also calculated the difference score for the presence of good behaviors (i.e., checking reliable cues) such that the proportion of emails in which a participant exhibited good behaviors in their pre-test was subtracted from the proportion of emails in which a participant exhibited good behaviors in their post-test. This good behavior difference score is the change in good behaviors exhibited such that a positive number reflects an increase in good behaviors and a negative score reflects a decrease in good behaviors.
Results
Signal Detection Theory Measures
We decided a priori to make d′ and c the primary signal detection measures. The box plots in Figure 2 and the statistics in Table 1 show that the difference in sensitivity (d′) between pre- and post-tests was considerably greater in the training group (1.14) than in the control group (0.48). The difference in pre-post d′ changes was significant, t(38) = 2.05, p = .048, d = 0.65. Sensitivity increased from pre-test to post-test for both the training group, t(19) = 3.97, p = .001, d = 0.89, and the control group, t(19) = 3.22, p = .005, d = 0.72.

Box Plots of Change in d' From Pre-Test to Post-Test for Each Group. The plus signs represent the means.
Means and Standard Deviations of Dependent Measures Broken Down by Group.
Note. Standard deviations are presented in parentheses. Diff. is the difference between pre-test and post-test.
aMeasured in percent.
The box plots contain one outside value. Specifically, as we explain in the discussion, this low score was likely due to a technical error on the part of the participant rather than due to an inability to classify phishing emails. If this participant’s data is removed from the analysis, then the p value for the comparison of d′ differences goes down from .048 to .011.
Figure 2 and Table 1 also reveal greater standard deviations in the training condition than in the control condition. This possible violation of homogeneity of variance is not of serious concern here because t-tests are robust when the sample sizes are equal. Although the difference in standard deviations could potentially be of some theoretical importance, the difference was not significant, Levene’s F(1, 38) = 2.26, p = .141. Further, with the outside value removed from the analysis, the standard deviation for the training group is reduced from 1.28 to 1.14.
The criterion for deciding that an email was a phishing email decreased slightly more for the training group than for the control group, but the difference in decreases was not significant, t(38) = -0.55, p = .585, d = 0.17. There was strong evidence for a decrease in the training group, t(19) = -2.46, p = .024, d = 0.55, but only suggestive evidence in the control group, t(19) = -1.80, p = .087, d = 0.40.
Follow-up analyses were done on hit rates and false-alarm rates. Table 1 shows that the difference in hit rate between pre- and post-tests was greater in the training group than in the control group, although there was not conclusive evidence that the effect was not due to chance, t(38) = 1.86, p = .070, d = 0.59. The hit rate increased from pre-test to post-test for both the training group, t(19) = 4.71, p < .001, d = 1.05, and the control group, t(19) = 3.60, p = .002, d = 0.81. The false alarm rate decreased from pre-test to post-test slightly more for the training group than for the control group, but this difference was not significant, t(38) = -1.38, p = .176, d = 0.44. The false alarm rate decreased significantly for the training group, t(19) = -2.12, p = .047, d = 0.47, but not for the control group, t(19) = -0.54, p = .592, d = 0.12.
Good Behaviors
Table 1 shows that good behaviors increased more in the training group than they did in the control group. A two-factorial mixed ANOVA with group (training, control) as the between-subjects factor and email authenticity (legitimate, phishing) as the within-subjects factor found this difference to be significant, F(1, 38) = 13.01, p = .001, ηp2 = .255. There was also a significant main effect of email authenticity such that participants increased the frequency they exhibited good behaviors from pre-test to post-test significantly more for legitimate emails than phishing emails, F(1, 38) = 7.70, p = .009, ηp2 = .169. There was no evidence of an interaction between group and email authenticity, F(1, 38) = 0.67, p = .419, ηp2 = .017. There was conclusive evidence for an increase in good behaviors in the training group, t(19) = 6.31, p < .001, d = 1.41, and strongly suggestive evidence in the control group, t(19) = 1.97, p = .064, d = 0.44.
Discussion
The training group improved more at classifying emails as legitimate or phishing from pre-test to post-test than did the control group. There was both an increase in sensitivity to the difference between phishing and normal emails as well as a lowering of the criterion for classifying an email as a phishing email. The training group exhibited more good behaviors in the post-test than the pre-test and a larger increase in good behaviors than did the control group. Given that the training focused on informing users how to check reliable cues and how to interpret those cues, we attribute these improvements to the content of the training. We also attribute these improvements to the design of the training that conformed to many learning principles.
The control group improved from pre-test to post-test showing an increase in sensitivity to phishing emails. One possible explanation for this is that the post-test was easier than the pre-test despite the fact that we randomly assigned emails to the pre-test or post-test. Because the control group did not exhibit significantly more good behaviors, this explanation remains plausible. If this explanation is correct, then some of the performance improvements for the training group could be accounted for by the post-test being easier. However, the finding that the training group improved more than the control group indicates that the training was effective.
Although users that received training improved at classifying emails as legitimate or phishing, their performance was far from perfect. In the post-test, they were only able to correctly classify, on average, approximately two-thirds of phishing emails and still classified, on average, over a third of legitimate emails as phishing. We consider two reasons why this occurred. First, the phishing emails in the current study represented high-quality phishing emails because we minimized the presence of errors that can indicate the current email is a phishing email. In our set of phishing emails, only one had an error not present in the original email it was based on. There were no spelling errors, grammar errors, or low-quality graphics in any of the other phishing emails used in the current study that were not present in the original email. Some actual phishing emails will have these errors, and users will be able to use these errors to identify it as phishing. Accordingly, users’ performance in the current study should not be taken as an estimate for how many emails they would correctly identify in the real-world.
The second reason why users may have performed poorly was the training was relatively short (approximately 5 minutes depending on the user’s pace). It is likely that users need more training to be highly accurate at classifying emails. The training could be extended by adding more emails to evaluate. However, this may improve users’ performance less and less with each additional email. Instead, users may need to be given more guidance at the beginning to understand how to check reliable cues or interpret those reliable cues. For instance, the outside value in the training group showing low sensitivity may not have been due to the inability to recognize phishing emails but rather because they had technical difficulties examining the emails in the post-test. Specifically, they frequently long-pressed links to examine the target domain, which led to them to inadvertently click the links. The training describes long pressing links as a way to preview the URL, but the training does not specify that long pressing is only appropriate on a mobile device. Moreover, while a user is taking the training on a desktop, a URL can be previewed when the user long presses a link even though this action does not typically result in a link preview elsewhere. Thus, training should include more guidance about long pressing. Further, some users may not have understood what hovering was in this context or where the previewed URL was shown. A short video demonstrating how to hover and where to look to see the URL preview may have been helpful. Users may also have had trouble learning how to interpret reliable cues. For instance, when the pop-up boxes appeared next to links to teach users how to interpret the linked URLs, the Jigsaw quiz did not permit users to hover over the links to see a preview of the URL. This may have impeded users’ ability to learn from the pop-up boxes at a critical time. Additionally, users may not have known that the email address, one of the cues the training instructs users to check, can be spoofed. An extended training could convey this and discuss how to check whether it is spoofed.
Results of the current study should be interpreted in light of its limitations. First, the post-test may have been easier than the pre-test, which led to improved performance for the control group. This issue might have been alleviated by counter-balancing the pre- and post-tests between conditions. Second, our emails did not address users by name, which may be inconsistent with how actual emails appear normally. Indeed, approximately one-third of users mentioned this while evaluating the emails. Our emails did not include a personalized greeting so that they could be shown to all users. However, future studies should consider including a personalized greeting to make them appear more consistent with actual emails. Third, users were sometimes unfamiliar with the organization sending the email, which may have affected their judgments about those emails. Indeed, approximately one-third of users said “I’ve never heard of this company” or “I do not know anything about this company” while evaluating the emails. We intentionally used familiar organizations (e.g., Netflix and Apple) to increase the familiarity of the emails and to imitate how phishing emails appear in the real world. However, future work should consider how familiarity with an organization or individual affects users’ likelihood to engage with phishing emails.
The current research has important practical implications for cybersecurity. Although there are some automated tools that detect and flag phishing emails, these tools cannot protect users from all phishing attacks. Thus, users’ ability to identify phishing emails is critical to avoid becoming victims of these attacks. Our results support the notion that training users how to check and interpret reliable cues can lead to an improved ability to identify phishing emails. Training methods that focus on reliable cues can ostensibly decrease the likelihood these users fall victim to a phishing attack and experience financial loss or identity theft. Finally, our results support the use of online phishing quizzes, like the one by Jigsaw, as an effective way to improve users’ ability to identify phishing emails. Given that online phishing quizzes are freely available online (Perrault, 2018) and have substantial benefits, we recommend that their use as a first step in training users to avoid phishing attacks. It is important to note, however, that some users will require more extensive and detailed training to be able to consistently spot phishing emails.
Future research could examine the effect of quiz length and the use of multiple training sessions on the detection of phishing emails. Also of interest would be the degree to which training in an experimental setting generalizes to detection accuracy in a real-life setting.
Footnotes
Declaration of Conflicting Interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: One or more of the authors hold stock in the parent company of Jigsaw (Google).
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
