Abstract
Although a number of researchers consider serious games as effective teaching/learning tools, the literature is fragmented when it comes to the factors that shape the users’ experience and views. We are currently lacking a comprehensive tool that simultaneously examines their effectiveness while contrasting user views. The study is an attempt to fill this gap. It reports the development and validation of a scale which initially included seventy-two items belonging to thirteen factors. A total of 542 university students played two serious games and the aforementioned questionnaire was administered to them. The exploratory and confirmatory factor analysis revealed that twelve factors and fifty-three items should be retained. The final version of the Serious Games Evaluation Scale demonstrated satisfactory reliability and validity. The factorial structure of the questionnaire and its implications for research and practice are also discussed.
Introduction
One of the earliest, as well as one of the most widely used definitions for serious games (SGs), states that they are deliberately educational games/applications; their recreational aspects are minimal or even absent (Abt, 1970). While many consider SGs as effective for promoting learning (Connolly, Boyle, MacArthur, Hainey & Boyle, 2012; de Freitas, 2018; Erhel & Jamet, 2019; Lamb, Annetta, Firestone, & Etopio, 2018) in diverse educational scenarios and domains (Girard, Ecalle, & Magnan, 2013), a number of interconnected issues and concerns were not adequately addressed, until today. SGs’ impact on knowledge acquisition and how this knowledge is transferred to real-life conditions remain rather unclear (Blumberg, Almonte, Anthony & Hashimoto, 2013). The design and development of SGs is a multifaceted interdisciplinary process involving experts from many fields. A typical team is made up of software engineers, usability experts, media designers, and education experts. In addition, other experts may join forces during development, depending on the type of game, the supporting environment, and the teaching/learning requirements of the system. The end result of this collaborative effort requires robust evaluation methods in order to assess each and every aspect of SGs. Yet, this task is difficult, given that there are many different genres of SGs covering an equally large number of learning subjects. Hence, a specific evaluation method for one SG cannot be easily generalized to all SGs (Ravyse, Blignaut, Leendertz, & Woolner, 2017). This methodological deficiency led researchers to believe that we lack an established methodology for measuring their effectiveness (e.g., Serrano-Laguna, Manero, Freire, & Fernández-Manjón, 2018). Indeed, research examining many salient factors that render SGs effective is rather scarce (Ravyse et al., 2017). What is more, a number of factors suffer from definitional problems (e.g., immersion and presence) or there is no common consensus of what sub-features they encompass, causing some confusion on how they can be evaluated (Fokides, Atsikpasi, Kaimara, & Deliyannis, 2019).
Our approach raises the importance of user evaluation. We believe that it is important to identify how users view SGs as they can pinpoint the positive and negative system aspects, enabling designers to identify the important characteristics by analyzing the essence of the users’ experience. Then again, there is no clear definition of the term “user experience”, as there are conceptualization and as well as measurement problems (Buck, Khan, Fagan & Coman, 2018; Koeffel, Hochleitner, Leitner, Haller, Geven, & Tscheligi, 2010). For the industry, the “user experience” is viewed as a synonym of usability and user-centered design, while researchers primarily focus on subjective constructs such as emotions, feelings, and sensations (Buck et al., 2018). Even though this field has become a major concern for both researchers and practitioners in human-computer interaction (Lallemand, Gronier, & Koenig, 2015), its role in determining the success of educational applications and their acceptance by educators and learners is not fully recognized (O’Brien, 2016).
The study at hand presents a comprehensive tool that can measure as many as possible of the factors that come into play, in an attempt to fill the research gap described above. Thirteen factors and seventy-two items were included in an initial draft of a scale and a project was implemented in order to gather data that would allow the examination of its factorial structure and validity. The reasoning for selecting these factors, the research methodology, and the results of the exploratory and confirmatory factor analysis of the scale are discussed in the coming sections.
Background-research objective
As the learning outcomes are what interests most stakeholders, these were the focus of the majority of SGs assessments. In addition, an array of other factors was also examined, scattered across a large number of studies. For example, usability and engagement were considered by some (e.g., All, Castellar, & Van Looy, 2015). For others, engagement and motivation were the most significant factors (e.g., Huang, Huang, & Tschopp, 2010). Khan and Webster (2017) examined narration as a contributing factor. In an effort to holistically examine SGs, Steiner, Hollins, Kluijfhout, Dascalu, Nussbaumer, Albert, and Westera (2015) proposed the inclusion of learning effectiveness, usability, and enjoyment. Gameplay, challenge, fun, feedback, interaction, scenario, immersion, learning-game integration, and game design were emphasized by many (e.g., Marsh, 2011; Muratet, Viallet, Torguet, & Jessel, 2009). Calderón and Ruiz (2015) identified a total of eighteen features that are important, namely: game design and aesthetics, user’s satisfaction, performance, usability-ease of use-playability-learnability, engagement, usefulness, understandability, motivation, educational aspects, learning outcomes, user’s behavior-attitude-emotions, efficacy, social impact, enjoyment, acceptance, and interface. SGs encapsulate both leisure and “serious” purposes. Consequently, the users’ experience and, in turn, the users’ views for a given SG are influenced by its pedagogical as well as its gaming aspects.
The above studies used different factors, different genres of SGs were examined, and the learning subjects or the learning settings were also diverse. What is more, the literature regarding SGs’ assessment can be classified as rather fragmented. Further investigation revealed that there is no common consensus on the definition of some factors. For example, the terms “presence” and “immersion” were used interchangeably and were even examined using identical or similarly worded questions. The same also holds true for “usability” and “ease of use.” Evidently, more research is needed in order to establish which SGs’ features are important in shaping their learning effectiveness and the users’ views (Hersh & Leporini, 2018). Given that, what the study sought to accomplish was the development of a tool that would allow the simultaneous examination of a broad range of SG’s factors responsible for shaping the users’ views. Toward this end, by probing more into the relevant literature, thirteen factors commonly used for assessing the users’ views and experiences were brought to light:
The scale’s development
The next stage involved the resolution of the measuring issues that arise for each factor. Without a doubt, a number of scales measuring different aspects of users’ views do exist, each with its own strengths. Then again, the literature review revealed that for certain SGs’ genres, some scales measured a limited number of aspects of the users’ views, or contained ambiguous questions. Others did not follow optimal practices for scale development. Having the above deficiencies in mind, the development of the study’s scale attempted to follow the most widely accepted steps of scale development and validation, with the first being the item pool generation. More than twenty questionnaires measuring important constructs were reviewed, together with a number of popular questionnaires freely available in the human-computer interaction domain (e.g., the System Usability Scale). The following inclusion or exclusion criteria were applied: (a) to have been tested and validated in studies concerning SGs (or similar applications in case there were no relevant questionnaires), (b) when multiple questions were used for examining a factor, only the questions with high loadings were selected, and (c) when factors were examined using a single question, this question was considered for inclusion only if it loaded exceptionally high on its respective factor.
These sources provided an extensive pool containing more than 400 items. An iterative series of modifications and refinements followed. Redundant and similarly phrased items were removed, poorly worded items were either removed or rephrased, and items that were deemed as not contributing to the assessment of users’ views were removed as well. Having a variety of questions and balancing their number (i.e., to avoid the overrepresentation of a factor) was also a consideration. As a result, seventy-two items were retained for the expert review phase in which the content validity was assessed (Worthington & Whittaker, 2006). The pool of questions was translated into Greek by two groups. Each group consisted of one psychologist and one computer science professional with experience in SGs and questionnaire design and development (both proficient in the English language). The resulting versions were then back-translated into English and viewed by a third group of experts. A unified version was obtained through a consensus meeting for assessing the semantic adaptation. Thus, the initial version of the Serious Games Evaluation Scale (SGES) was formulated, having seventy-two items which were supposed to measure thirteen factors. Table 1 presents SGES’s factors, the number of items in each factor, and the initial source of these items. All items were presented in a 5-point Likert-type scale, worded “Strongly Agree,” “Agree,” “Neutral,” “Disagree,” and “Strongly Disagree.”
The items’ sources
The items’ sources
As already stated, the purpose of the study was to develop a scale for measuring the users’ views/experiences when playing SGs. Having developed the initial scale, the next step was to confirm its factorial structure and to validate it. For collecting the necessary data, a project was designed and implemented which lasted from mid-January to mid-March 2018.
Materials
There are many different SGs genres and their quality is dissimilar as well. On the other hand, the study’s objective was not to evaluate a specific SG but the development of a scale able to measure the users’ views (either good or bad). In this respect, the game’s quality and type were irrelevant. Following this line of thinking, two games developed by Triseum (
Participants
The sample was students studying at the Department of Audio and Visual Arts, Ionian University, Corfu and at the Department of Primary Education, University of the Aegean, Rhodes, both in Greece. It has to be stressed that the selection of the above target groups was not a decision taken without consideration. Arté Mecenas’ learning subject is history and arts and Variant: Limits’ is calculus. In this respect, one option was to select students studying arts’ history or advanced maths. Another option was to select participants regardless of their field of studies, in line with the reasoning that led to the selection of the two SGs. Considering both options, it was decided the sample to simulate an audience which is not entirely focused on either of the two SGs but, at the same time, has an interest in playing both. The purpose was to achieve a balanced sample by avoiding over motivated/unmotivated, interested/uninterested participants, without compromising the fact that SGs are addressed, by default, to explicit groups of users. As the curriculum of both groups includes courses related to arts and maths, but these courses are not so specific as was the learning subjects of both games, these groups were considered as ideal for the study’s purposes.
A total of 570 students enrolled, in exchange for course credit or for the opportunity to fulfill a course’s requirements. They were recruited through a research announcement posted on the Facebook groups these two departments maintain, addressed to anyone interested to participate in the project. There was no initial qualification for research participation.
Procedure
The participating students were gathered to the Universities’ computer labs. They were informed that they were going to play an SG (or two if they were interested in doing so) and complete a questionnaire. They were also informed that the study was conducted on a voluntary basis, that their anonymity was guaranteed, and that completion of the questionnaire was taken as an implicit expression of consent to participation. Their only duty was to play the game(s) for a minimum of two hours (each) and/or complete at least two levels. The introductory/tutorial levels (for familiarizing the players with the interface/controls), did not count as playing the game(s) per se. Immediately after playing the game(s), participants were provided with the questionnaire’s link as it was available only online. It has to be noted that although each lab could accommodate around thirty students, it was decided only ten to be present at a time, so as participants to feel more relaxed and have some privacy.
Data screening-descriptive statistics
The questionnaires were checked for partial and unengaged responses. The number of the valid ones left after this screening was 542. Thus, the final sample consisted of an equal number of students, approximately 23 years old (
Results analysis
The initial data set was split into two-random-halves as Exploratory Factor Analysis (EFA), as well as Confirmatory Factor Analysis (CFA) were to be conducted. The EFA was essential for establishing the underlying dimensions between the variables and the latent constructs since the SGES was based on translated and adapted versions of items from multiple sources. For assessing the structure of the seventy-two items in the initial version of SGES, the data were imputed into SPSS 25 and principal axis factor analysis (PAF) with oblique rotation was selected. That is because PAF is more suitable for non-normally distributed data (Costello & Osborne, 2005) and takes into account the covariation between variables (Kline, 2005). Oblique rotation is better suited for research involving human behaviors as it produces more accurate results (Costello & Osborne, 2005). Specifically, the promax rotation (kappa = 4) was used as suggested by others (Matsunaga, 2010).
Item removal was deemed necessary for improving the clarity of the data structure. Several criteria were applied for item(s) deletion: (a) communalities coefficients below .50, (b) factor loadings below
The EFA was then re-run for a final time, with the fifty-three retained items. The data were well suited for factorial analysis because: (a) the sample size (
All items loaded high on their respective factors (>.60) while each factor averaged above the .70 recommended level (Hair et al., 2010) (Table 3). Cross-loadings between the retained items was not an issue. Moreover, there were no correlations between the factors greater than .70. The total variance explained by the twelve components was 80.51%. The internal consistency was very good as assessed using Cronbach’s alpha (DeVellis, 2016), ranging between .88 and .95 for the constructs, while the overall score was .963 (Table 3, last row).
Parallel analysis
Parallel analysis
Item loadings
(Continued)
Following the EFA, the factorial structure was inputted into AMOS 25 for performing CFA using the remaining half of the sample. The internal consistency of the scale and its constructs was re-assessed using Cronbach’s alpha. It was found that the overall score was
For model fit assessment, the literature recommended the use of three (and more) fit indices. For the Comparative Fit Index (CFI), values exceeding .95 indicate excellent fit (Hu & Bentler, 1999). For the Root Mean Square Error of Approximation (RMSEA), values less than .06 also indicate a very good model fit (Hu & Bentler, 1999). For the Standardized Root Mean Square Residual (SRMR), values close to zero indicate a perfect fit, while values less than .08 are used as a cut-off point, indicating excellent fit (Hu & Bentler, 1999; McDonald & Ho, 2002). Finally, experts recommended not to rely on the chi-square test statistic when the sample size exceeds 200 cases, as it has the tendency to indicate statistically significant differences (Hu & Bentler, 1999). Instead, they recommended the use of the minimum discrepancy divided by its degrees of freedom (
Results for the measurement model
Notes. –: This value was fixed at 1.00 for model identification purposes; SE: standardized estimate.

Results of the CFA.
The hypothesized twelve-factor model was compared against three alternative models in terms of overall model fit. The eleven-factor model merged enjoyment and feedback, while the ten-factor model merged enjoyment, narrative, and feedback. A one-factor model was also used as a baseline. All models had the same number of cases and observed variables or items. On the basis of the results, as presented in Table 5, it is evident that the twelve-factor solution had the best overall fit indices.
SGES’s convergent validity was checked by measuring the average variance extracted (AVE) (Table 6). The AVE in all but one case was above the .70 level as suggested by Hu and Bentler (1999). Although the AVE of Pre fell slightly below the .70 threshold, given that all the other indices were more than satisfactory, it was considered an acceptable deviation from the recommended values. Moreover, reliability was also evident, as all critical ratios were above .70 (Hancock, 2001). The presence of discriminant validity was evaluated by comparing the square root of the AVE for any given factor with the correlations between this factor and all other factors (Hu & Bentler, 1999). It was found that the variance shared between a factor and any other factor was less than the variance that the construct shared with its measures. Thus, discriminant validity was satisfactory in all cases.
Model fit measures’ comparisons
Finally, both the EFA and the CFA were re-run, but this time the data were split according to the SG that was used. The purpose was to test whether the data coming from each SG separately had an impact on the scale’s factorial structure, validity, and reliability. The results indicated that the factorial structure remained unchanged in both SGs. In addition, the model fit indices were excellent for both games (SG1:
Convergent and discriminant validity
Notes. CR: Critical ratio; AVE: Average Variance Extracted; diagonal in bold: square root of AVE extracted from observed variables; off-diagonal: correlations between constructs.
In conclusion, the results of the EFA and CFA established the instrument’s factorial structure and demonstrated that it had satisfactory validity and reliability. The final version of the SGES with the fifty-three retained items and their respective constructs is presented in the Appendix.
The literature proposes several assessment methods for serious games. According to Ifenthaler, Eseryel, and Ge (2012) there are three distinct types of evaluation: (a) game scoring (e.g., targets achieved, obstacles or time needed to complete a task or an iteration), (b) external assessment (e.g., interviews, tests or surveys), and (c) embedded or internal assessment (e.g., learner’s behavior such as clickstreams or log files). Faizan, Löffler, Heininger, Utesch, and Krcmar (2019) in their literature review classified the evaluation methods in accordance to three phases: (a) pre-game, (b) in-game, and (c) post-game, in order to assess their learning effectiveness. In general, researchers collect data through questionnaires regarding respondents’ opinions, attitudes, feelings, and perceptions on a particular matter. Questionnaires are the most frequently used evaluation tool during the pre- and post-game phases. However, the in-game assessment is also important because this type of evaluation provides instant feedback about the learning process. Such investigations traditionally include counting the number of mistakes, checkpoints, in-game performance tracking, storyboarding, game experience questionnaire while playing, monitoring students’ progress, think-aloud protocols, self-reports, interviews, and unobtrusive observation. Along with observation, useful tools also include video recordings, detection of facial electromyography activity, measuring the effect of emotion in multiple vocal cues and the physiological activity measures (e.g., heart rate, blood volume pressure, and skin conductance). Currently, a diversity of wearable sensors is used for such measurements (Rebelo, Noriega, Duarte, & Soares, 2012). The usage of these tools demands subjects’ agreement taking into account all the ethical commitments of the researchers. Others advance further the investigation domain by combining interaction parameters, facial expression recognition, and/or audio analysis in combination with machine learning techniques (Bartlett, Hager, Ekman, & Sejnowski, 1999; Pham, Kim, Lu, Jung, & Won, 2019; Wu & Lin, 2018; Zeng, Pantic, Roisman, & Huang, 2009). The main advantage of these nonverbal methods is that measurements are more objective compared with subjective data derived by questionnaires and self-reports. Post-game phase evaluation includes questionnaires, measure learner’s knowledge after the game. Although all the above methods are valid (and valuable), each has its own field of use depending on the researchers’ needs. Consequently, in our case, a post-game questionnaire was chosen as an assessment method, because the objective of our work was to formulate a scale to holistically assess serious games.
Indeed, both the industry and researchers are in need of a comprehensive scale which is psychometrically validated and suitable for evaluation purposes. Toward this end, a new scale called SGES was developed based on a rigorous multistage system of scale development and validation. In this pursuit, a sufficient number of resources (e.g., existing scales and general-purpose scales) concerning SGs’ evaluation were screened and generated the initial item pool. The item pool underwent a series of iterative phases of modifications and expert reviews before being pilot-tested. Once refined, the scale was administered to a large sample of 542 university students, who evaluated two SGs. For data analyses, EFA and CFA were performed in order to uncover the underlying factors and for validating the scale.
Out of the initial seventy-two items and the inclusion of thirteen factors, nineteen items and one factor (perceived playability) had to be dropped during the EFA. The retained factors were: presence, enjoyment, perceived learning effectiveness, perceived narratives’ adequacy, perceived realism, perceived feedback’s adequacy, perceived audiovisual adequacy, perceived relevance to personal interests, perceived goals’ clarity, perceived ease of use, perceived adequacy of the learning material, and motivation. A number of reasons might have contributed to this outcome. All questions were originally in English; some items might have been poorly translated into Greek or participants might have found their meaning ambiguous. Indeed, in the final scale, there is an item which was supposed to measure presence (Pre6: “There were times when the virtual objects seemed to be as real as the real ones”), but it proved to be the question with the third strongest loading on perceived realism. The most significant reason for dropping items was the rules that were applied for item and factor retention. In general, Hair et al.’s (2010) recommendation for high items’ loading (>.60) and high factors’ average (>.7) was followed. Perceived playability’s items proved to be the most problematic ones; they either loaded low on perceived ease of use or there were significant cross-loadings with the above factor as well as with other factors (e.g., perceived feedback’s adequacy). On the basis of the results, it seems that participants viewed ease of use, usability, and playability as a single concept and merged them in one factor under the name “perceived ease of use”.
Even so, the final scale does not contain factors represented with less than three items. If this was the case, these factors might have been considered as unstable (Costello & Osborne, 2005; Raubenheimer, 2004). Actually, only four factors are measured using three items (motivation, perceived relevance to personal interests, perceived goals’ clarity, and perceived feedback’s adequacy), while all the other factors are measured at least with four. Yet, the total variance explained by the fifty-three items was 80.51%, which is more than satisfactory (Hair et al., 2010). Moreover, SGES’s reliability and internal consistency, as a whole and per construct, was well above the .70 threshold (
The same holds true for SGES’s convergent and discriminant validity as no problems were noted during the CFA. Indeed, results obtained from the CFA demonstrated that the model fit indices were exceptionally good and that the same applied for its discriminant validity. One minor deviation from the recommended threshold for convergent validity was noted in one factor, namely presence (Pre = .655, recommended value > .70), but it was considered acceptable as it does not affect the scale’s overall convergent validity.
Taking together the above, it can be concluded that SGES seems to be a quite robust scale and short, in terms of how many items it has and how many factors it measures. The estimated time needed for completing it is between ten to fifteen minutes.
Implications for research and practice
Past research provided evidence that SGs can be effective learning tools (Connolly et al., 2012; de Freitas, 2018; Erhel & Jamet, 2019; Lamb et al., 2018) and that they can be useful in a wide range of learning subjects and educational scenarios (Girard et al., 2013). While this holds true, until now, there was no common consensus on how to measure the users’ views for these applications. Though studies scrutinizing the impact of certain factors on the learning outcomes do exist, each analyzed either one (e.g., Khan & Webster, 2017) or a different set of factors (e.g., All et al., 2015; Marsh, 2011; Steiner et al., 2015) and each used different instruments for validating these factors. Very few studies implied that a larger number of factors should be considered when evaluating SGs but did not suggest (or validate) an instrument for examining them (e.g., Calderón & Ruiz, 2015). On the other hand, in this study, by reviewing the relevant literature, twelve subjective factors were located and a scale for measuring them was developed and tested. The scale’s substantial internal consistency, stability, and validity, are indicators that, indeed, these factors shape the user’s views when playing SGs. Thus, the study’s contribution to research is that it proposes an instrument for measuring multiple factors. This, in turn, can lead to the establishment of a much-desired common methodology for measuring SGs effectiveness (Serrano-Laguna et al., 2018). In addition, as all the factors included in SGES are subjective ones, it can provide a better understanding of what shapes users’ experiences when playing SGs. As presented in a preceding section, while the term “user experience” is ill-defined (Buck et al., 2018; Koeffel et al., 2010), it is acknowledged that it is an important aspect of human-computer interaction (Lallemand et al., 2015) and plays an important role in the success of educational applications.
Additionally, there are two reasons that render the scale a flexible tool. First, it can be used for assessing a variety of SGs. That is because two quite different SGs were used and the analysis provided evidence that, in both cases, the attributes of the scale remained unchanged. In this respect, SGES can be considered as a first step in overcoming a major problem highlighted by others, namely, the generalizability of the evaluation methods (e.g., Ravyse et al., 2017). Second, it can be assumed, within reasonable limits, that the scale has a modular structure, meaning that factors can be excluded without altering the instrument’s validity. The strong items’ loadings on their respective factors and their minimal cross-correlations justify this assumption. As a result, the scale can be used in a variety of situations, depending on the researchers’ needs, allowing them to assess different affordances, individually or together. For example, if an SG under evaluation has no narrative components, researchers (or practitioners for that matter) can remove the perceived narration’s adequacy component from the scale.
The practical applications of the SGES derive from its robust structure and satisfactory attributes. It is suitable for a wide variety of target groups and SGs, as long as the games examined possess the aspects that are being measured and players are capable to use them. Indeed, two different SGs were used and the sample in which it was administered consisted of university students with mixed computer and game-playing competences (ranging from novice to expert level, see section “Data screening-descriptive statistics”). Thus, experts from the game industry can average (or sum) the scores in each factor and/or obtain a composite score indicating the overall users’ views and experiences. These scores can then be used for comparing different SGs (either of the same or different genre). Alternatively, they can do the same for comparing different versions of the same SG and determine if the latest version is perceived to be an improvement compared to the previous ones. Educators can also benefit from the use of SGES in a similar fashion. By administering it to their students, they can determine which aspects of an SG contributed either positively or negatively to the learning outcomes. Also, within specific and focused application fields, by administering different systems to students, educators can determine which aspects of an SG contributed either positively or negatively to the learning outcomes, enabling them to combine features and create superior systems before initiating their own developmental process.
Although developers and educators can independently benefit from the use of SGES, our view is that this process has to be implemented in a combined fashion. A single run of the evaluation process reveals to both developers and educators the strong and weak parts of the SG. Thus, this decomposition that is offered inherently by the method can help speed-up the iterative or recursive lifecycle of the environment as both parties can identify and improve the system (interaction design, usability, aesthetics) and content deficiencies altogether.
Conclusion
The study has several limitations that need to be addressed. The trustworthiness of participants’ responses in questionnaires is always a concern. University students coming from two quite different disciplines of study took part in the study; the sample might not be representative of the intended audience of the SGs that were used. Participants were asked to play the games for two hours. This length of time might not be sufficient for users to form a comprehensive view of the SGs. Moreover, the study was based on the assumption that SGs’ quality and genre are not important as it sought to develop a scale for examining how users view them. On the other hand, it is unknown how participants might have reacted if different SGs were used. Since the study was conducted in a controlled environment with other participants present, it is possible their views to have been affected; if the SGs were played in a more relaxed environment (e.g., at home) the results might have been different. Another limitation is the scale’s newness. As it was just developed, there is no information on SGES’s scoring standard.
The above limitations may serve as directions for future research. Since SGES is a new scale, further validations are definitely needed that will provide evidence for its validity and reliability. Different target groups (e.g., in terms of age, studies, and level of education), as well as a larger variety of SGs, will demonstrate if and how its structure is affected. Additional factors that shape the learning and the gaming experience can be considered. A research path the authors are already planning to explore is to compare the subjective factors the scale measures with data that measure actual learning (e.g., through knowledge acquisition tests). This will provide evidence on how (and to what extent) the factors interact with each other and what impact they have on the learning outcomes.
Finally, an important factor that needs to be further examined are the different playing patterns/intentions and the resulting game adaptations that have to occur in order to render SGs useful for each user. For example, users may play a game without prior knowledge on the subject matter, while others may play it after having studied the theory and related texts. Under certain circumstances, gamers may use hints and tips to progress faster within the game (a common practice across players), while others may not use this feature. In future revisions of SGES, questions related to the above can be considered for inclusion. Moreover, it would be interesting to examine whether it is possible to integrate the scale into an SG allowing the latter to dynamically adjust, leading to a personalized and improved gamers’ experience.
In sum, despite the above limitations, the study contributes to the relevant literature by providing a tool that simultaneously assesses many factors that shape the users’ views when playing SGs. Thus, the study’s results might prove useful to educators, researchers, and developers in planning their lessons, in understanding the interactions between these factors, and for designing even more effective SGs.
Footnotes
Appendix
(Continued)
