Teaching Decision Making With Serious Games

Abstract

Game-based training may have different characteristics than other forms of instruction. The independent validation of the Intelligence Advanced Research Projects Activity (IARPA) Sirius program evaluated game-based cognitive bias training across several games with a common set of control groups. Control groups included a professionally produced video that taught the same cognitive biases and an unrelated video that did not teach any biases. Knowledge was tested immediately after training and after a delay. This article presents the results from the two phases of the Sirius program. Game-based training showed advantages in teaching bias mitigation skills (procedural knowledge) but had no advantage over video instruction in teaching people to answer explicit questions about biases (declarative knowledge). Overall, training effects persisted over time, and games performed as well as and in some cases better than the video-based instruction for knowledge retention. Our results suggest that serious games can be an effective training tool, particularly for teaching procedural knowledge.

Keywords

serious games independent evaluation game intervention cognitive bias

Serious games use game features such as goal orientation, feedback, and adaptive challenge to present educational content in an immersive environment with the goal of developing skills or knowledge, rather than entertaining (Shute, Ventura, Zapata-Rivera, & Bauer, 2009). Many educators are interested in leveraging the design principles of video games to create more engaging and motivating learning environments (Romero, Usart, & Ott, 2015; Squire, 2007). Despite the widespread interest in serious games, relatively little is known about the advantages they have over more traditional teaching methods. Evaluations that have compared serious games to traditional teaching methods have often had methodological shortcomings, such as a lack of necessary experimental controls (Girard, Ecalle, & Magnan, 2013). As discussed in this special issue, the Intelligence Advanced Research Projects Activity (IARPA) Sirius program provided a unique opportunity to evaluate the pedagogical effectiveness of several differently themed serious games, compared to a noninteractive educational video, through a controlled experiment. In this article, we provide an overview of the testing and evaluation (T&E) approach used in the Sirius program and key findings regarding the types of knowledge that serious games teach best.

The Sirius program tested the efficacy of serious games for teaching about and minimizing cognitive biases. Cognitive biases are systematic errors in judgment and decision making that result when people automatically apply mental shortcuts or rules of thumb in situations where more careful thought may be required (Tversky & Kahneman, 1974; see Soll, Milkman, & Payne, 2015, for a recent review). The Sirius program invited multidisciplinary teams to create serious games that would train people to recognize and mitigate six cognitive biases: fundamental attribution error, bias blind spot, confirmation bias, anchoring, representativeness, and projection. The primary goal of the games in the Sirius program was to improve people’s ability to avoid or mitigate the influence of cognitive biases when making judgments or decisions, a form of procedural knowledge. A secondary goal of the games was to improve the ability to correctly identify a description of a cognitive bias, a form of declarative knowledge. Put another way, the declarative knowledge emphasis tested knowledge of “what is” a bias, and the procedural knowledge emphasis tested knowledge of “how to” avoid a bias (Anderson, 1982).

Each research team in the Sirius program developed games that would teach both declarative and procedural knowledge of each bias. Research in science education has suggested that students who engage in hands-on exercises perform better on problem-solving tests relying on procedural knowledge than students who observe demonstrations (Glasson, 1989). Given that serious games provide interactive experience with the subject matter akin to hands-on exercises, they may show a particular advantage over noninteractive media for teaching procedural knowledge of cognitive biases. The Sirius program tested this possibility as well as serious games’ ability to yield not only short-term training effects but also long-term effects. The extent to which serious games can enact changes in thinking or behavior over the long-term is an important question that has yet to be addressed by prior evaluations of serious games (Girard, Ecalle, & Magnan, 2013).

What Are Cognitive Biases?

To illustrate a cognitive bias, consider the following example from Tversky and Kahneman (1974). In a room of 100 people, there are 30 engineers and 70 lawyers. Steve is in this room and on first impression seems shy, withdrawn, orderly, and conscientious—characteristics that are consistent with the stereotypical portrayal of engineers. When estimating the likelihood that Steve is an engineer, people who focus solely on the characteristics of Steve might be committing a type of representativeness bias called base rate neglect, in which people overweight specific case information and ignore prior probabilities. In this case, there are more lawyers than engineers present in the room, so any estimate of the likelihood of Steve being an engineer should take this “base rate” proportion into account. In order to avoid base rate neglect and many other cognitive biases, people must learn to recognize what information they are attending to and identify what additional information might be relevant for that judgment. Many researchers have distinguished between two different types of reasoning—a fast, unconscious, and automatic component (System 1) and a slow, conscious, and deliberative component (System 2; Evans, 2003; Glöckner & Witteman, 2010). The automatic attention to Steve’s characteristics can be considered a product of System 1 thinking, while forcing oneself to attend to other relevant information in the situation is a task for System 2. Because the attention to Steve’s characteristics happens on a fast time scale, in many cases, people may not even be aware of their own thought processes until the biased judgment has already been made. For this reason, interactive features of serious games that offer the ability to practice making judgments and receive feedback could be an effective approach for teaching bias mitigation strategies.

Measuring Cognitive Bias

A key component of the Sirius program was the T&E team, which created a standardized assessment of cognitive bias and used it to evaluate the effectiveness of each team’s game in a controlled experiment. The Assessment of Biases in Cognition (ABC) tests (Gertner, Zaromb, Schneider, Roberts, & Matthews, 2016) were developed for this program by the MITRE Corporation (MITRE) and Educational Testing Service (ETS). The Johns Hopkins University Applied Physics Laboratory conducted the Independent Validation and Verification (IV&V) experiment. A benefit of the T&E component was that it used an independently developed metric of success, allowing comparisons of effectiveness to be made across games. As described in the other articles in this special issue, research teams developed their own metric(s) during game development to assess how well games were performing. However, because these metrics were developed by the same group of people who designed the games, who had full knowledge of the questions that would be tested, the possibility that games could “teach to the test” was an inherent risk. The goal of the Sirius program was not to improve performance on one specific test; rather, the goal was to improve people’s reasoning and decision making skills so that they could be better equipped to avoid cognitive biases in a variety of settings. To this end, MITRE and ETS developed the ABC test to be an independent, standardized assessment of cognitive bias for each phase of the Sirius program (Gertner et al., 2016). This assessment incorporated a variety of procedures for measuring cognitive biases, which were adapted from procedures used in the extensive judgment and decision making literature (Kahneman & Frederick, 2002; Krull et al., 1999; Nickerson, 1998; Pronin, Lin, & Ross, 2002; Ross, Greene, & House, 1977; Strack & Mussweiler, 1997). The ABC consisted of two types of items—recognition and discrimination (RD) items and bias elicitation (BE) items—that tested declarative and procedural knowledge, respectively. The game development teams did not know which questions would be included on this assessment, thus emphasizing the need to improve participants’ deep knowledge of cognitive biases so that they could apply the concepts on new, unseen problems.

Independent Testing of Serious Games

Another benefit of the T&E component was that, through the IV&V process, an independent organization tested the efficacy of each team’s game. Although the Sirius teams conducted iterative testing during the development of their games, replication by an independent entity offered numerous advantages. First, it better controlled for sample selection, which ruled out the possibility that effects observed by the research teams were attributable to the research participants they selected. For example, it is possible that the samples selected by research teams during iterative testing could have differed in meaningful ways, such as average education level. Additionally, although many teams followed similar procedures during their iterative testing, there is nevertheless the possibility that differences in extraneous factors—such as the number of distractions in the environment, interactions between experimenter and participants, type of compensation offered, amount of time that elapsed between playing the game and being tested, etc.—could have influenced the outcomes of their research studies. The IV&V process controlled for these factors by conducting an experiment in which each participant had an equal chance of being assigned to each media condition and each condition followed an identical procedure. In other words, the IV&V experiment helped ensure that the only thing that differed across conditions was the type of media used in the training.

The T&E process involved a second control condition to account for an effect commonly observed in multistage experiments. When participants are tested more than once, such as in a pretest/training/posttest design, the repeated testing can have an effect on performance. In particular, participants often tend to perform better on the posttest than they did on the pretest, even if no intervention was administered (Kim & Willson, 2010). One reason for this testing effect could be that participants become familiar with the material that is being tested. They may also develop strategies for answering and be less anxious than they were the first time, which can positively affect performance. Several studies in cognitive psychology have demonstrated that repeated testing, even with no feedback provided and no studying in between, results in greater long-term knowledge retention than repeated studying (Roediger & Karpicke, 2006). To control for testing effects such as these in an experimental design, it is beneficial to have a control condition in which participants take the repeated tests but do not participate in an intervention. Performance in the intervention conditions can be compared to performance in the nonintervention condition to determine whether the intervention significantly improved performance above and beyond the effects of repeated testing.

In the sections that follow, we describe the procedure and results of the IV&V experiment in both phases of the Sirius program. Two main comparisons were of interest. First, we compared the performance of each serious game to a video that taught related content but lacked interactivity (related video). If serious games have an advantage over noninteractive media for teaching cognitive bias content, we expected that performance in the game conditions would be better than performance in the related video condition. Additionally, we compared the performance of each game to a video that taught content unrelated to cognitive biases (unrelated video). If serious games result in more learning than that which occurs by repeated testing, we would expect that performance in the game conditions would be better than performance in the unrelated video condition. We performed these comparisons for both declarative knowledge questions and procedural knowledge questions on the ABC, immediately after training and after a delay.

Method

Participants and Procedure

In total, 1,219 participants were recruited across the two phases of the Sirius program (487 participants in Phase 1, 732 participants in Phase 2). Recruitment materials for the study did not mention that it was a study about games; they referred only to multimedia in order to prevent biasing participants against the video control conditions. All performers in the IARPA Sirius program were advised to recruit in a similar manner. Participants were recruited from universities and were offered either course credit or gift cards for their participation. The experiment consisted of two in-person sessions and one online session, as shown in Figure 1. At the first session, participants provided demographic information and took the ABC as a pretest. Following that, participants were randomly assigned to a media condition in which they either played a serious game or watched a video. Participants took the ABC a second time immediately after engaging with one of the media. Finally, to measure long-term knowledge retention, participants took the ABC a third time after a delay period (8 weeks in Phase 1, 12 weeks in Phase 2). Three equivalent versions of the ABC were developed, so that no participant took the exact same version more than once, and the presentation order of the tests was counterbalanced. Participants who did not complete the follow-up test were excluded from analyses. Items were embedded in the ABC to measure whether participants paid adequate attention to the test, and participants who failed to answer at least two of these attention check items correctly were excluded from analyses. In Phase 1, 337 participants completed the 8-week follow-up, of whom 301 passed attention checks. In Phase 2, 455 participants completed the follow-up, of whom 325 passed attention checks.

Figure 1.
Game testing protocol.

Conditions

Phase 1 tested five games: Missing, Heuristica, Cycles, Enemy of Reason, and Macbeth. Two of these games failed to meet Sirius program performance criteria at the end of Phase 1, so Phase 2 testing consisted of only three games: Missing, Heuristica, and Cycles. There were two additional control conditions included in Phase 2—one in which the related video was presented twice (n = 38) and one in which no ABC pretest was conducted before media training (n = 40). Results for these conditions can be found in Kopecky et al. (2016). The length of a single play of each game ranged from 30 to 90 min (Table 1).

Table 1.
Single Play Duration and Sample Size Breakdown for Serious Game Conditions.

Phase Game Average Single Play Duration (Minutes) N

Pre/Post Follow- Up Passed Attention Checks

Phase 1 Cycles 30 57 46 43

Heuristica 90 74 52 51

Missing 90 64 41 38

Enemy of Reason 60 64 39 36

Macbeth 60 76 51 46

Phase 2 Cycles 60 118 84 62

Heuristica 70 123 70 50

Missing 60 119 68 48

Games included a mixture of scenarios and tasks that required participants to make judgments. Specific strategies for avoiding biases were provided throughout each game. Games differed in theme and approach for teaching cognitive bias content. Heuristica was a three-dimensional, first-person science fiction game that featured various learning opportunities to learn about cognitive biases. Cycles was a two-dimensional flash-based game consisting of a collection of different puzzles, with an overarching science fiction theme. Macbeth and Missing consisted of overarching mystery narratives with miniature challenges that taught bias content. Missing consisted of episodes of play time, in which participants had to find and use information related to a crime, and after-action reviews that elaborated upon the content taught during play episodes. Enemy of Reason also used a mystery narrative with a more humorous tone, and bias content was taught through miniature challenges and in the main story line. More details on the approaches used in Cycles and Missing can be found in other papers in this special issue and more details on Heuristica can be found in Veinott et al. (2013). Screenshots of Cycles, Heuristica, and Missing can be found in Figures 2, 3, and 4, respectively.

Figure 2.
Screenshots of Cycles. Adapted from Cycles team (University at Albany, Colorado State University, University of Arizona, Syracuse University, Temple University, and 1st Playable Productions). Printed with permission from IARPA under government purpose rights.

Figure 3.
Screenshots of Heuristica. Adapted from Heuristica team (Applied Research Associates, Indiana University, Georgia Tech, Wright State University, Institute of Human Machine Cognition, and Virtual Heroes Division of Applied Research Associates). Printed with permission from IARPA under government purpose rights.

Figure 4.
Screenshots of Missing. Adapted from Missing team (Leidos, Boston University, Carnegie Mellon University, and Creative Technologies Incorporated). Printed with permission from IARPA under government purpose rights.

The control videos were similar across Phases 1 and 2: (1) a related video, provided by IARPA, that taught the same cognitive biases as the games but lacked interactivity (n = 64 in Phase 1, n = 53 in Phase 2) and (2) an unrelated video, which was a clip from This Emotional Life, a program produced by Public Broadcasting Service (PBS) that taught content unrelated to cognitive biases (n = 23 in Phase 1, n = 35 in Phase 2). The related control videos for Phases 1 and 2 were of similar style and length but taught the appropriate content for the different biases in Phases 1 and 2. The two video conditions lasted 30–35 min. The related video consisted of an engaging scientist presenting about cognitive biases through a series of illustrative vignettes and examples (Figure 5).

Figure 5.
Screenshots of related video provided by IARPA. Adapted from 522 Productions. Printed with permission from IARPA under government purpose rights.

Assessment of Biases in Cognition (ABC)

Behavioral Elicitation (BE) Scale

The primary goal of all interventions was to teach people to avoid biased thinking, as this skill has significant implications for everyday reasoning and decision making. The BE Scale of the ABC tested participants’ ability to avoid cognitive biases in a set of judgment tasks. The BE Scale consisted of six subscales, one for each bias: Confirmation Bias (12 items), Fundamental Attribution Error (8 items), Bias Blind Spot (8 items), Anchoring (15–17 items), Representativeness (19 items), and Projection (21 items). The number of anchoring items varied depending on which of the three equivalent versions of the ABC was being used. Items were adapted from the extensive literature on cognitive biases and required participants to make judgments under uncertainty, using only the information provided. To answer these items correctly, participants had to rely on procedural knowledge and metacognitive abilities to (1) inhibit automatic responses, (2) recognize the bias in their reasoning, and/or (3) apply an appropriate rule or strategy to guide their reasoning in an unbiased way.

Recognition and Discrimination (RD) Scale

A secondary goal of the interventions was to improve declarative knowledge of the six biases, as measured by the RD Scale of the ABC. The RD Scale consisted of fewer items than the BE scale; there were 13 items in the Phase 1 RD scale and 9 items in Phase 2 RD scale. Each item presented the test taker with a description of a scenario in which a person was committing a cognitive bias (Figure 6). Items were multiple choice and participants had to select the bias best described by the scenario. These items were considered tests of declarative knowledge because they tested participants’ factual knowledge of biases. Higher scores on this scale indicated greater RD knowledge.

Figure 6.
Example recognition and discrimination and behavioral elicitation items.

For each item, scoring methods were developed that defined correct or normative responses—that is, judgments that combined the appropriate information in the appropriate way. Biased responses were those that deviated from normative responses in a predictable, systematic way. BE Scale scores were continuous variables in which higher scores represented less bias. Details on the calculation of the RD and BE Scale scores and their reliabilities can be found in Gertner, Zaromb, Schneider, Roberts, and Matthews (2016).

A key component of the T&E design was that game developers knew which biases would be tested on the ABC but had no knowledge of the actual items that would be tested. In this way, the ABC tested participants’ ability to apply cognitive bias knowledge to unseen items. It is worth noting, however, that game developers were familiar with the same body of literature as test designers, and for some biases, a relatively limited number of elicitation methods exist in the literature.

Sirius Performance Criteria

In order to advance to Phase 2 in the Sirius program, games had to demonstrate that they reduced cognitive bias at both the immediate posttest and the delayed test. Games also had to perform significantly better than the related video control condition at the immediate and delayed tests.

Analysis

To measure improvements from pretest to posttest and pretest to follow-up, paired t-tests were performed on the RD and average BE scores. To measure the overall effect of condition on performance, an analysis of covariance (ANCOVA) was conducted on the scores with pretest scores entered as a covariate. Planned comparisons were conducted to detect significant differences between games and control videos. To estimate the strength of the relative effect of games compared to the control videos, Cohen’s d (Cohen, 1992) was calculated on the difference scores from pretest to posttest and pretest to follow-up, where larger d values indicated a stronger effect of games. The primary comparison of interest was the amount of improvement demonstrated by serious games compared to the related video; a second comparison of interest was that of serious games to the unrelated video.

Results

RD Scale

Results from the RD Scale in Phases 1 and 2 are presented in Figures 7 and 8. A clear pattern in most conditions across both phases was that performance at posttest was significantly higher than performance at pretest (ts > 4.85, ps < .001), indicating that participants were learning from the media. Performance at follow-up tended to be lower than performance at posttest, indicating some decay. In most cases, performance at follow-up did not decline back to the level that it was at pretest, suggesting that participants were still experiencing a benefit from training 8-12 weeks after the intervention. The only condition in which this pattern did not hold was the unrelated video condition. In the Phase 1 unrelated video condition, neither posttest nor follow-up performance was significantly different from pretest (ps > .10). In the Phase 2 unrelated video condition, posttest performance and follow-up performance were significantly higher than pretest performance (ts > 2.33, ps < .05), but the sizes of the training effects were substantially smaller than other conditions. This suggests there was some beneficial effect of repeated testing in Phase 2, but it was smaller than the effects of training with educational media.

Figure 7.
Phase 1 pre/post/delay scores for recognition and discrimination.

Figure 8.
Phase 2 pre/post/delay scores for recognition and discrimination.

Games versus control conditions

One-way ANCOVAs on the RD posttest scores revealed a significant overall effect of condition in both Phase 1, F(6, 286) = 18.24, p < .001, and Phase 2, F(4, 243) = 23.43, p < .001 (see Figures 7 and 8 for means). Planned comparisons revealed that all of the games in Phases 1 and 2 performed significantly better on RD than the unrelated video at posttest (ts > 4.07, ps < .001 for both phases; ds = 0.60–0.98 for Phase 1, ds = 0.79–1.56 for Phase 2), suggesting that they provided more training than what was acquired through repeated testing. There was also a significant overall effect of condition at follow-up in Phase 1, F(6, 283) = 9.68, p < .001, and Phase 2, F(4, 243) = 4.21, p < .01. Planned comparisons showed that, at follow-up, all of the Phase 1 games performed better than the unrelated video on RD (ts > 3.01, ps < .01, ds = 0.42–0.74), but none of the Phase 2 games did (ts < 1.54, ps > .10, ds = −0.25–0.18).

Another comparison of interest was the performance of serious games compared to the related video. None of the Phase 1 or Phase 2 games beat the related video at posttest or follow-up on RD. In fact, Missing and Macbeth performed significantly worse than the related video in Phase 1 (ts < −2.98, p < .01) and Missing performed significantly worse than the related video in Phase 2 (t (243) = −2.67, p < .01). This pattern of results suggests that none of the serious games was more effective than the related video at teaching declarative knowledge.

BE Scale

Paired t-tests on the BE Scale showed that across phases, bias mitigation performance tended to increase significantly from pretest to posttest in all conditions except the unrelated video condition (ts > 2.99, ps < .01; see Figures 9 and 10 for means). In Phase 1, follow-up performance did not decline significantly from posttest performance in the games and related video conditions, suggesting strong knowledge retention. In Phase 2, follow-up performance decreased slightly for these conditions but, like performance on the RD Scale, did not return to pretraining levels.

Figure 9.
Phase 1 pre/post/delay scores for bias elicitation.

Figure 10.
Phase 2 pre/post/delay scores for bias elicitation.

Games versus control conditions

In Phases 1 and 2, there was a significant overall effect of condition on posttest BE scores, F(6, 287) = 10.30, p < .001 and F (4, 243) = 17.19, p < .001, and follow-up BE scores, F(6, 287) = 5.53, p < .001, and F(4, 243) = 2.00, p = .09. In Phase 1, all game conditions produced significantly better bias mitigation performance than the unrelated video at posttest (ts > 2.47, ps < .05, ds = 0.44–1.18), and all games except Macbeth beat the unrelated video at follow-up as well (ts > 2.48, ps < .05, ds = 0.40–0.57). In Phase 2, Cycles, Missing, and Heuristica all produced better bias mitigation than the unrelated video at posttest (ts > 3.61, ps < .001, ds = 0.95–1.56), and Cycles continued to perform better than the unrelated video at follow-up, t(243) > 2.79, p < .01, d = 0.60.

The comparison between serious games and the related video condition on the BE Scale revealed a different pattern from the RD Scale. In Phase 1, Enemy of Reason and Missing produced better bias mitigation performance than the related video at posttest (ts > 2.00, ps < .05, ds = 0.37, 0.54, respectively), and Cycles produced marginally better performance, t(287) = 1.96, p = .05, d = 0.46; the same conditions plus Heuristica performed better than the related video at follow-up (ts > 2.72, ps < .01, ds = 0.49–0.68). In Phase 2, Cycles and Missing each exceeded the related video condition at posttest (ts > 2.77, ps < .01, d = 0.71 and 0.50, respectively), but none of the serious game conditions exceeded the related video at follow-up (ts < 1.15, ps > .20). This pattern of results suggests that some of the serious games were more effective than the related video at teaching procedural knowledge.

Discussion

Despite the interest in serious games as educational tools, there has been relatively little empirical work examining the benefits of serious games over more traditional methods that teach the same content but lack game-like elements. Moreover, among the evaluations that have conducted randomized, controlled experiments, the evidence for the effectiveness of serious games over control conditions has been somewhat mixed (Girard, Ecalle, & Magnan, 2013). Through the Sirius program, several teams of researchers and game designers developed serious games that were tested by an independent organization in a controlled experiment and demonstrated that serious games can offer unique advantages over noninteractive media.

Results of the IV&V of the Sirius program demonstrated that, in both phases of the program, serious games often had an advantage over video-based instruction for teaching procedural knowledge about cognitive biases and performed similarly to video-based instruction for teaching declarative knowledge. Serious games and video-based instruction both resulted in training effects that persisted over time, with some games—such as Phase 1 versions of Enemy of Reason, Missing, Cycles, and Heuristica—outperforming video-based instruction in knowledge retention. These results suggest that the advantage of serious games is especially clear for teaching concepts that may be difficult to learn passively, such as the procedural knowledge portion (BE Scale) of the ABC. Our results suggest that the more effective way to learn “how to” knowledge in order to mitigate cognitive biases is to practice making judgments and receive feedback on those judgments, as opposed to observing other people make judgments through vignettes or stories (as was the case in the related video condition). However, when the goal of training was to understand and remember a collection of facts, in this case recognition and discrimination of biases (RD Scale), our results across both phases provided strong evidence that serious games may not have a particular advantage over noninteractive media. An analogy can be made to learning how to drive. To pass a driver skills test, which tests procedural knowledge, the most effective way to prepare is to actually practice driving rather than read a manual about driving or watch someone else drive. To pass the written, multiple-choice driving exam, practice behind the wheel probably would not hurt performance on the written exam but may not yield any meaningful advantages over studying a manual. Glasson (1989) found a similar result in physical science classes; specifically, students who engaged in hands-on activities during physical science classes did better on a problem-solving test that tested procedural knowledge than students who observed teacher demonstrations. Hands-on activities did not have an advantage over teacher demonstrations for a test of factual knowledge.

An alternative explanation for the finding that serious games improved immediate performance on the BE Scale more so than control videos could be that people spent more time playing the games than watching the videos. Time on task is an important variable to consider when evaluating the effects of game-based learning since it tends to be associated with learning (Tobias, Fletcher, & Wind, 2014). If time on task were the only variable driving the effects in the present study, however, we would have expected serious games to outperform the related video on both the RD and BE Scales, since the games were significantly longer in duration. Instead, the related video condition performed slightly better than the game conditions on the RD Scale despite having less content coverage than the games. We also would have expected the longer games to outperform the shorter games, which was not the case in either phase of the program. For example, in Phase 1, the average play duration for Cycles was significantly shorter than the other four games, yet Cycles emerged as one of the top performers. In Phase 2, Heuristica was longer than Missing and Cycles but did not perform as well as them on the declarative and procedural knowledge tests. Moreover, Kopecky et al. (2016) found that, in many cases, watching the related video twice did not result in significantly greater learning gains than a single viewing. Finally, several studies conducted during development of the games failed to find significant differences in learning outcomes between 30-min and 60- to 75-min versions of the games (Dunbar et al., 2014; Veinott et al., 2013). Taken together, these findings suggest that the duration of the intervention is not sufficient to explain the observed differences in learning outcomes. Our results are better explained by the fact that the games and lecture video focused on imparting different types of knowledge.

Identifying the specific features of games that make them more effective for learning procedural knowledge than alternative media was of interest in the Sirius program, and other articles in this special issue discuss studies conducted by the research teams to evaluate the effectiveness of different game elements, such as repetition, narrative depth, and reward structure. In particular, Bush (2017) and Strzalkowski and Symborski (2017) discuss the beneficial effects that repeated play had on learning outcomes. An important takeaway from the Sirius program was that, although serious games can be effective pedagogical tools, they are not automatically so. The performance differences that were observed between games in Phases 1 and 2 demonstrate that, although all games included interactive practice, they did not all perform equally well on the ABC. Although the opportunity for interactive practice was a salient difference between serious games and the related video, it is likely that the most successful games incorporated other features that differentiated them from other games and the related video. For example, some games incorporated mnemonic devices or taught specific mitigation strategies that the related video and other games did not. To better understand the factors that make serious games effective pedagogical tools, future research evaluating serious games in comparison to other instructional methods should further examine the relative contribution of different game features to learning outcomes.

Phase	Game	Average Single Play Duration (Minutes)	N
Phase 1	Cycles	30	57	46	43
Heuristica	90	74	52	51
Missing	90	64	41	38
Enemy of Reason	60	64	39	36
Macbeth	60	76	51	46
Phase 2	Cycles	60	118	84	62
Heuristica	70	123	70	50
Missing	60	119	68	48

Footnotes

Authors’ Note

The research reported in this paper was completed while Franklin Zaromb was employed at Educational Testing Service. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, the U.S. Government, or the authors’ host institutions.

Acknowledgments

This work was accomplished in support of the IARPA Sirius Program, Broad Agency Announcement (BAA) number IARPA-BAA-11-03.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was accomplished in support of the Intelligence Advanced Research Projects Activity (IARPA) Sirius Program, BAA number IARPA-BAA-11-03.

References

Anderson

J. R.

(1982). Acquisition of cognitive skill. Psychological Review, 89, 369–406.

Bush

R. M.

(2017). Serious play: An introduction to the Sirius Research Program. Games and Culture, 12, 227–232.

Cohen

(1992). A power primer. Psychological Bulletin, 112, 155–159. Retrieved from https://dx-doi-org.web.bisu.edu.cn/10.1037/0033-2909.112.1.155

Dunbar

N. E.

Jensen

M. L.

Miller

C. H.

Bessarabova

Straub

S. K.

Wilson

S. N.

… Scheutzler

(2014). Mitigating cognitive bias through the use of serious games: Effects of feedback. In Spagnolli

Chittaro

Gamberini

(Eds.), Persuasive Technology: 9th International Conference (pp. 92–105). Berlin, Germany: Springer.

Evans

J. S. B. T.

(2003). In two minds: Dual-process accounts of reasoning. Trends in Cognitive Sciences, 7, 454–459. doi:10.1016/j.tics.2003.08.012

Gertner

Zaromb

Schneider

Roberts

R. D.

Matthews

(2016). The assessment of biases in cognition: Development and evaluation of an assessment instrument for the measurement of cognitive bias (MITRE Technical Report MTR160163). McLean, VA: The MITRE Corporation. Retrieved from https://www.mitre.org/publications/technical-papers/the-assessment-of-biases-in-cognition

Girard

Ecalle

Magnan

(2013). Serious games as new educational tools: How effective are they? A meta-analysis of recent studies. Journal of Computer Assisted Learning, 29, 207–219. doi:10.1111/j.1365-2729.2012.00489.x

Glasson

G. E.

(1989). The effects of hands-on and teacher demonstration laboratory methods on science achievement in relation to reasoning ability and prior knowledge. Journal of Research in Science Teaching, 26, 121–131. doi:10.1002/tea.3660260204

Glöckner

Witteman

(2010). Beyond dual-process models: A categorisation of processes underlying intuitive judgement and decision making. Thinking & Reasoning, 16, 1–25. doi:10.1080/13546780903395748

10.

Kahneman

Frederick

(2002). Representativeness revisited: Attribute substitution in intuitive judgment. In Gilovich

Griffin

Kahneman

(Eds.), Heuristics and biases: The psychology of intuitive judgment (pp. 49–81). New York, NY: Cambridge University Press.

11.

Kopecky

Rhodes

R. E.

Bos

McKneely

Goforth

Perrone

… Gertner

(2016). Evaluation of a serious game intervention for reducing cognitive biases. Manuscript submitted for publication.

12.

Kim

E. S.

Willson

V. L.

(2010). Evaluating pretest effects in pre-post studies. Educational and Psychological Measurement, 70, 744–759.

13.

Krull

D. S.

Loy

M. H.

Lin

Wang

C. F.

Chen

Zhao

(1999). The fundamental fundamental attribution error: Correspondence bias in individualist and collectivist cultures. Personality and Social Psychology Bulletin, 25, 1208–1219.

14.

Nickerson

R. S.

(1998). Confirmation bias: A ubiquitous phenomenon in many guises. Review of General Psychology, 2, 175–220.

15.

Pronin

Lin

D. Y.

Ross

(2002). The bias blind spot: Perceptions of bias in self versus others. Personality and Social Psychology Bulletin, 28, 369–381.

16.

Roediger

H. L.

Karpicke

J. D.

(2006). Test-enhanced learning: Taking memory tests improves long-term retention. Psychological Science, 17, 249–255. doi:10.1111/j.1467-9280.2006.01693.x

17.

Romero

Usart

Ott

(2015). Can serious games contribute to developing and sustaining 21st Century Skills? Games and Culture, 10, 148–177. doi:10.1177/1555412014548919

18.

Ross

Greene

House

(1977). The false-consensus effect: An egocentric bias in social perception and attribution process. Journal of Experimental Social Psychology, 13, 279–301.

19.

Shute

V. J.

Ventura

Bauer

Zapata-Rivera

(2009). Melding the power of serious games and embedded assessment to monitor and foster learning: Flow and grow. In Ritterfeld

Cody

Vorderer

(Eds.), Serious Games: Mechanisms and effects (pp. 295–321). New York, NY: Routledge.

20.

Soll

J. B.

Milkman

K. L.

Payne

J. W.

(2015). A user’s guide to debiasing. In Keren

(Eds.), Wiley-Blackwell handbook of judgment and decision making (pp. 924–951). Chichester, UK: John Wiley & Sons, Ltd. doi:10.1002/9781118468333.ch33

21.

Squire

K. D.

(2007). Games, learning, and society: Building a field. Educational Technology, 4, 51–54.

22.

Strack

Mussweiler

(1997). Explaining the enigmatic anchoring effect: Mechanisms of selective accessibility. Journal of Personality and Social Psychology, 73, 437–446.

23.

Strzalkowski

Symborski

(2017). Lessons learned about serious game design and development. Games and Culture, 12, 292–298.

24.

Tobias

Fletcher

J. D.

Wind

A. P.

(2014). Game-based learning. In Spector

J. M.

Merrill

D. M.

Elen

Bishop

M. J.

(Eds.), Handbook of research on educational communications and technology (pp. 485–504). New York, NY: Springer.

25.

Tversky

Kahneman

(1974). Judgment under uncertainty: Heuristics and biases. Science, 185, 1124–1131. doi:10.1126/science.185.4157.1124

26.

Veinott

E. S.

Leonard

Papautsky

E. L.

Perelman

Stankovic

Lorince

Hoffman

R. R.

(2013, 9). The effect of camera perspective and session duration on training decision making in a serious video game. Paper in the proceedings of the IEEE Games Innovation Conference (IGIC), Vancouver, BC. Retrieved from http://ieeexplore.ieee.org/document/6659170/