Abstract
Social-emotional learning (SEL) is gaining increasing attention in education policy and practice due to growing evidence that related constructs are strongly predictive of long-term academic achievement and attainment. However, the work of educators to support SEL is hampered by a lack of available, unbiased measures of related competencies. In this study we conducted a literature review to investigate whether assessment metadata (typically data relevant to how students behave on a test or survey) can provide information on SEL constructs. Implications of this new source of SEL data for practice, policy, and research are discussed.
Keywords
Social-emotional learning (SEL) is an old concept gaining new traction in education practice and policy. Psychological constructs associated with SEL often fall into broad categories like interpersonal, intrapersonal, and deep cognitive competencies (Soland, Hamilton, & Stecher, 2013), and include relatively new concepts like grit (Duckworth & Quinn, 2009) and growth mindset (Dweck, 2006). Research increasingly shows that such constructs predict long-term outcomes like high school graduation and earnings (Almlund, Duckworth, Heckman, & Kautz, 2011; Belfield et al., 2015; Dweck, Walton, & Cohen, 2011; Heckman & Vytlacil, 2001). As a result, SEL is being incorporated into education accountability and policy. For example, a consortium of California districts serving over 1.5 million students used SEL measures in its accountability system under a No Child Left Behind waiver (West, Buckley, Krachman, & Bookman, 2018). At the federal level, the Every Student Succeeds Act (ESSA) of 2015 requires states to include a nonacademic indicator in their accountability plans. Though most states have opted to rely on administrative data like attendance rates, the law nonetheless signals a desire to hold schools accountable for outcomes beyond achievement that are related to SEL.
Despite the newfound emphasis on SEL, there are often insufficient measures to assess student progress and evaluate program impacts on SEL-related constructs. The biggest reason for the shortage is that most measures take the form of surveys, which can suffer from self-report bias, contextual variability, respondent disengagement, and other factors that undermine inferences educators wish to make based on them (Anderson, Their, & Pitts, 2017; Duckworth & Yeager, 2015; Piedmont, McCrae, Riemann, & Angleitner, 2000). Self-report bias is especially problematic and results when students are unwilling to report negative behaviors or attitudes, or have trouble accurately appraising themselves on the construct (Duckworth & Yeager, 2015). Although new measures like performance-based assessments (requiring students to directly demonstrate the competency in question) and administrative data on behaviors like low grades are available alternatives, questions of cultural relevance (Gregory & Fergus, 2017) and logistical, financial, and political barriers to administering the assessments remain (Duckworth & Yeager, 2015; Soland et al., 2013).
Though this measurement shortage affects a wide swath of SEL-related constructs, the issue is pronounced for intrapersonal mindsets and competencies, which often involve how students perceive themselves academically and are therefore hard to measure beyond self-report instruments. These intrapersonal constructs have been shown to be vital to high school graduation and college readiness (Dweck et al., 2011; Farrington et al., 2013; West et al., 2017). For example, when students start to lose faith in their academic abilities (low self-efficacy), their motivation to undertake tasks usually dwindles (Bandura, 1997). This lack of motivation can manifest itself in behaviors associated with low self-management, like course failures and chronic absenteeism, and in decreased willingness to work hard (conscientiousness) and persist on long-term tasks (grit) (Dweck et al., 2011; Farrington et al., 2013). Altogether, these behaviors are often early warning signs that a student may drop out and is disengaging from school more generally (Farrington et al., 2013; Soland, Jensen, Keys, Wolk, & Bi, 2019).
In this study, we consider a potential new source of data on intrapersonal SEL competencies: metadata captured when students take a test, survey, or other measure. Specifically, after conducting a thorough review of the literature consisting of thousands of articles, we discuss 15 empirical studies examining whether metadata captured when students take an assessment can provide information relevant to intrapersonal SEL constructs. Although these assessments are often not meant to measure SEL at all, taking an assessment like a math achievement test requires not only knowledge of the academic content, but also belief in one’s mastery of the content (self-efficacy) and the conscientiousness and self-management to stay focused on the task at hand (Kyllonen & Kell, 2018; Wise, 2015). These metadata (defined in Table 1) include whether students (a) spend extra time on difficult items, (b) rapidly guess on items, (c) do not provide an answer to a question, or (d) select responses idiosyncratically.
Definition of Assessment Metadata Used as Proxies for SEL Constructs
Note. SEL = social-emotional learning.
Findings from our review suggest that these metadata are generally correlated with other measures of relevant SEL constructs (concurrent validity evidence) and with educational outcomes like finishing high school (predictive validity evidence). Although not all intrapersonal SEL constructs reviewed were associated with these metadata, research generally suggests that four constructs, which we define in Table 2, are correlated with assessment metadata: self-efficacy, conscientiousness, self-management, and grit. Therefore, a majority of the studies we reviewed often find that such metadata may be useful as a safeguard against self-report bias on SEL surveys of those four constructs, or when no SEL data are available. We discuss what the literature says on associations among these constructs, academic disengagement behaviors, and assessment metadata, including cases in which there does not seem to be a strong association between metadata and a given SEL construct.
Definition of Constructs Shown to Have an Association With Metadata
As our review helped reveal, for certain SEL constructs, these assessment behaviors are like the measurement equivalent of natural experiments in economics: they are not meant to be direct assessments (just as natural experiments are not designed to randomize study participants), yet variation in outcomes from these naturally occurring phenomena can provide meaningful information as if they were intended for those purposes. By synthesizing and reviewing the 15 studies considered herein, we investigate whether a compelling validity argument for uses of these data as proxies for intrapersonal SEL constructs can be made. We begin by considering problems with current SEL measures before reviewing the articles on assessment metadata and, ultimately, discussing metadata’s benefits and limitations related to SEL.
Moving Beyond Self-Report Measures
Obtaining robust measures of SEL competencies (including the ones emphasized in this review) is not easy. Researchers and evaluators mostly rely on data from self-reported psychometric scales, where students are asked to answer a series of Likert-type items. Although many surveys are supported by validity evidence for their intended uses (Soland et al., 2013), they often have major shortcomings (Duckworth & Yeager, 2015). For example, survey scores can be affected by social desirability bias, which occurs when students respond to items in a manner that is viewed favorably by others but does not reflect their actual position on the latent variable (Krosnick, Narayan, & Smith, 1996). Survey scores can also suffer from reference group bias when students have different implicit standards for what it takes to excel on a given construct, an issue that can skew comparisons across groups (West et al., 2018). Finally, when using self-reported measures to capture SEL competencies, some students respond carelessly, which typically provides biased information (Curran et al., 2015).
One way researchers are attempting to move beyond self-report measures is by quantifying direct behaviors related to the SEL construct of interest. In some cases, educators use administrative data on behaviors related to SEL. In the early warning systems literature, which is devoted to identifying and supporting students who are at risk of dropping out of school, behavioral indicators of disengagement are prominent (Allensworth, 2013; Balfanz, Herzog, & Mac Iver, 2007). As Farrington and colleagues (2012) point out, “academic behaviors are the visible, outward signs that a student is engaged and putting forth effort to learn. Because they are observable behaviors, they are also relatively easy to describe, monitor, and measure” (p. 8). For example, students who are chronically absent, fail courses, and are suspended often are much more likely to drop out (Allensworth, 2013; Balfanz et al., 2007). These major disengagement behaviors typically begin with much milder behaviors like coming to class unprepared and struggling to complete independent work (Farrington et al., 2012). Such small behaviors often stem from low self-efficacy and are manifestations of low self-management, conscientiousness, and grit, all of which are associated with dropping out (Bandura, 1997; Briesch & Chafouleas, 2009; Duckworth & Quinn, 2009; Zamarro, Nichols, Duckworth, & D’Mello, 2018).
One potential problem with gaining data on SEL by observing real behaviors that occur during schooling is that such behaviors can often be related to a host of factors that have nothing to do with a particular social-emotional construct. For example, although students who fail courses may have low academic self-efficacy, they may also receive very low grades due to personal or situational issues unrelated to self-efficacy. Further, these personal, situational, and social-emotional factors are often interrelated: For instance, students’ personal circumstances could lead to poor engagement in school at an early age, which in turn contributes to a belief that they cannot do well in school, which is then reinforced by consistently low grades.
Researchers have responded to these limitations by developing direct, performance-based assessments of social-emotional competencies (Miller & Linn, 2000). By constraining the conditions in which data are collected, the hope is to help standardize results and potentially remove irrelevant sources of variance. Such measures assess these competencies by having students directly perform tasks that relate to the construct of interest, which helps avoid self-report bias and may provide more authentic assessments of multifaceted constructs like creativity (Soland et al., 2013).
There are many examples of performance assessments being used to measure constructs related to SEL. A classic example is Mischel, Ebbesen, and Zeiss’s (1972) famous “Marshmallow Test,” which was designed to measure self-regulatory skills that are highly related to constructs like self-management. More recently, Galla et al. (2014) developed an Academic Diligence Task, a performance assessment that further standardizes the initial Marshmallow Test.
One problem with such assessments is that construct-irrelevant variance can still be an issue if contextual factors influence results (Shavelson, Baxter, & Pine, 1991). For example, new studies suggest that the Marshmallow Test—as well as certain types of SEL measures more generally—may capture differences in socioeconomic status rather than actual differences on the construct of interest (Watts, Duncan, & Quan, 2018). This result reflects a broader concern across types of measures that particular behaviors associated with SEL, whether measured directly or via self-report, are based on a particular cultural frame of reference (Gregory & Fergus, 2017). Research, for instance, suggests that Black and White students may exhibit the same disruptive behaviors, but Black students are more likely to be suspended for them (Anderson & Ritter, 2018; Skiba et al., 2011). As another example unrelated to racial bias, a science performance task might produce biased results for a student if, say, a pipette breaks. To overcome such challenges, computer technology is being used to make contextual factors more standard (Soland et al., 2013). For instance, The Programme for International Student Assessment (PISA) now offers a test of collaborative problem solving during which the student directly collaborates with an avatar, a simulated person with known problem-solving and teamwork capacities.
Despite advances in performance assessments that can help avoid self-report bias and standardize conditions in ways that can reduce other sources of construct-irrelevant bias, these measures still have limitations. First, tasks are generally very costly and difficult to collect in large samples, although new technologies are making this easier (Soland et al., 2013). Second, it is not always clear that artificial tasks completed in highly constrained settings are generalizable to other contexts (Bardsley, 2008; Duckworth & Yeager, 2015; Falk & Heckman, 2009). Finally, existing performance tasks can be difficult to implement multiple times, as participants might gain familiarity after having performed the task once, upwardly biasing subsequent scores (Bardsley, 2008; Duckworth & Yeager, 2015; Falk & Heckman, 2009). Given these challenges, the pace at which performance tasks are developed is slow, and their adoption among educators may be even slower, further contributing to the shortage in available SEL measures (Duckworth & Yeager, 2015; Soland et al., 2013).
Emerging Evidence on the Relationship Between Metadata and Social-Emotional Competencies
Until recently, virtually no research considered the use of test and survey behavioral metadata to gain information on students’ SEL needs. Using metadata is essentially a hybrid of observing student behaviors in school and measuring student behaviors in a controlled environment akin to those in performance tasks. Although the metadata are captured during the administration of assessments that occur during the course of schooling, the conditions are often more consistent than during regular classroom instruction due to standardized protocols surrounding testing. Though most assessments are not designed to capture behaviors related to SEL (e.g., skipping items on a survey), related metadata are often available. We review the literature examining potential links between metadata and SEL constructs.
Literature Search Procedures
Identifying relevant SEL competencies and metadata
We began our search with the recognition that SEL constructs are wide-ranging and not easily classified. We drew upon Farrington and colleagues’ (2012) review of the research on connections between SEL constructs and academic disengagement to focus our search. We ultimately limited ourselves to the two intrapersonal categories of SEL competencies that Farrington and colleagues (2012) identified as most proximal to academic disengagement behaviors: academic perseverance and mindsets. We searched for all constructs identified by Farrington and colleagues (2012) in those two categories, as well as any related construct in other SEL literature reviews (Duckworth & Yeager, 2015; Dweck et al., 2011; Soland et al., 2013). Those specific constructs are provided in Appendix Table A1, which we describe below.
Similarly, we defined “metadata” as any of the four types of data listed in Table 1. Metadata were limited to these particular types based on prior research suggesting their association with assessment disengagement is supported by the most comprehensive validity evidence for such a use (Curran, 2015; Wise, 2015). For instance, although test decline (decreasing performance over the course of a test) can be related to disengagement (Borghans & Schils, 2012), it is also associated with test fatigue and factors irrelevant to the construct of engagement. Therefore, there was not a strong research base supporting its use as a proxy for engagement.
Search procedure
Our systematic review unfolded in three phases that mirror those used by other scholars (Kraft, Blazar, & Hogan, 2017). First, we searched for relevant terms using the on-line databases Academic Search Premier, Econ Lit, Ed Abstracts, ERIC, Google Scholar, ProQuest, and PsycINFO. For each search, we paired at least one relevant SEL term with at least one relevant assessment engagement term. Counts of database hits by combined SEL-engagement search terms can be found in Appendix Tables A1 and A2. Second, we reviewed references from other relevant literature reviews, including reviews of assessment engagement (Curran, 2015; Wise, 2015) and SEL measurement (Duckworth & Yeager, 2015; Dweck et al., 2011; Farrington et al., 2012; Soland et al., 2013). Third, we contacted authors with expertise in test/survey disengagement and measuring SEL to try to ensure our search did not miss any major studies in the field. Altogether, this process generated 3,129 studies for review.
Inclusion/exclusion criteria
We then used several criteria for inclusion/exclusion to limit the initial list to the studies we ultimately discuss. First and perhaps most obviously, we only included studies that examined the relationship between some outcome related to SEL and assessment metadata. Some search term combinations yielded a large number of hits but very few relevant articles. For example, searching for “self-efficacy” and “survey nonresponse” produced nearly 600 articles, but most of them related to problems of low survey response on measures of self-efficacy (e.g., “Development of the Alcohol and Other Drug Self-Efficacy Scale”, Kranz, 2003), not to the relationship between survey nonresponse behavior and self-efficacy.
As an important caveat to the first criterion, we are primarily interested in concurrent validity evidence that supports use of test engagement as a proxy for SEL competencies (or similarly disconfirming evidence). That is, we only included studies that examined the relationship between assessment engagement metadata and another measure of SEL, not studies that considered test engagement to be a standalone social-emotional outcome of interest. For example, DeAngelis (2018) studied the relationship between private schooling and student effort on PISA assessments. Although such a study is certainly relevant to questions of assessment disengagement, the study did not primarily compare the association between engagement on the PISA to a different SEL measure. Therefore, the study was not included in our review.
Second, we limited the review to studies completed on or before December 31, 2018. Third, although we focused on peer-reviewed research, we did not exclude non-peer-reviewed working papers because, given this is a young line of research, several emerging studies with direct relevance to this already small field have been published at research institutions or through various university working paper series. Findings from such studies should be treated with appropriate skepticism.
Ultimately, the 15 studies satisfying our inclusion/exclusion criteria are provided in Tables 3 and 4. Table 3 shows results from the studies we discuss related to achievement test metadata, including the authors, data sources, types of assessment metadata, and findings. Table 4 shows the same but for survey metadata.
Summary of Findings on the Relationships Among Achievement Test Metadata, SEL, and Academic Engagement
Note. SEL = social-emotional learning.
Summary of Findings on the Relationships Among Survey Metadata, SEL, Academic Engagement, and Later-Life Outcomes
Note. SEL = social-emotional learning.
Evidence on Metadata From Achievement Tests
Although achievement test metadata have been used as a measurement tool for decades, those data are typically used to address measurement problems on those tests, not to provide information on social-emotional competencies. The most comprehensive work on achievement test metadata was developed by Wise and Kong (2005) and reviewed by Wise (2015). He and his colleagues showed that student engagement on achievement tests can be measured by identifying responses to items that are provided so rapidly that the content of those items could not have been understood (Demars, 2007; Rios, Liu, & Bridgeman, 2014; Wise & Kong, 2005). For example, if a student responds to an item with a lengthy reading passage in under 10 seconds, one can be fairly certain the student did not engage with that item. This behavior is often referred to as “rapid guessing” because students who respond rapidly enough get items correct at a rate no better than chance (Demars, 2007; Kong, Wise, & Bhola, 2007; Wise & Kong, 2005). Rapid guessing is largely uncorrelated with academic ability, meaning that this behavior is not just occurring because students do not understand the content (Wise, 2015).
Emerging research shows that the amount of time students spend on achievement test items—and whether students rapidly guessed—is related to more than test engagement. Although most of these studies involve tests with minimal stakes for students, there is initial evidence that rapid guessing also occurs on tests used for state accountability purposes (Soland, 2019). Work conducted by Soland, Jensen, Keys, Wolk, and Bi (2019) showed that rates of rapid guessing are related to social-emotional competencies, and to broader disengagement from school. As reported in Table 3, partial correlations between rapid guessing rates and self-management scores were 0.26, and the same correlations for self-efficacy were 0.12 (both significant at the 0.01 level). In terms of academic disengagement, which often stems from factors like low self-management and self-efficacy (Farrington et al., 2012), Table 3 shows that students who rapidly guessed on 10% or more of the items on a given test had lower GPAs and attendance, as well as higher rates of suspensions and detentions. In tandem, these findings suggest that students who rapidly guess are often disengaging from not only the test, but from school more generally, and may be at risk of dropping out.
Soland (2019) also found that rapid guessing can provide information on how students respond to academic challenge. Students in his sample were 18 times as likely to rapidly guess on difficult items if they were in the bottom quartile of self-efficacy scores compared to students in the top quartile. Similarly, students spent 1.5 times longer on very difficult items if their self-efficacy scores were in the top quartile rather than the bottom. Thus, achievement tests may yield data on how students respond to challenging tasks by capturing duration data that help quantify whether students persist on especially difficult items.
Related work indicates that measures of test engagement and item durations are also related to conscientiousness. Barry and Finney (2016) used latent growth models to show that item durations, as well as responses to surveys asking students about their test engagement, were associated with conscientiousness. In earlier work, Barry and colleagues (2010) also showed that item durations are correlated with scores on constructs from the Big 5 personality characteristics, especially conscientiousness. Thus, the research of Barry and colleagues (2010, 2016) and Soland and colleagues (2019) provides consistent evidence that engagement on an achievement test is correlated with self-efficacy, conscientiousness, and self-management.
Beyond response times, research also considers behaviors on achievement tests like skipping questions on a test, especially when there is no guessing penalty. Hernández and Hershaff (2014) found that skipping questions on state standardized tests was associated with lower probabilities of high school graduation and college enrollment among students in Michigan. Their statistical model included controls for prior achievement.
Despite these emerging findings, there is also evidence that test metadata are not always as strongly associated with all SEL constructs. For example, Soland and colleagues (2019) also found that rapid guessing had low, nonsignificant correlations with other related constructs like growth mindset. Similarly, Soland (2019) found that the association between giving up on difficult items and SEL scores was much lower when the latter construct was academic motivation rather than self-efficacy (though the correlation was still statistically significant). Finally, Barry and colleagues (2010, 2016) found that item durations were less strongly correlated with other Big 5 measures like neuroticism than conscientiousness. Thus, results on the association between test engagement metadata and SEL competencies appear fairly dependent on the SEL competency under consideration. Perhaps unsurprisingly, most significant findings involved SEL constructs like self-efficacy and self-management, both of which have been shown to predict persistence on academic tasks, especially difficult ones (e.g., Bandura, 1997).
Evidence on Metadata From Surveys
Tests of academic achievement are not the only assessments that students take in school. Increasingly, students are also given surveys to, for example, assess school climate, evaluate their teachers, or disclose personal information about themselves. Like achievement tests, surveys require more than basic literacy skills and cognitive ability to complete them. They also require that students engage and exert effort to respond to each item (Curran, 2015; Meade & Craig, 2012). According to Curran (2015), disengaged responding has been documented at rates ranging from 5% to 50% of collected surveys, depending on the context and detection method. In some cases, disengaged responding manifests itself when students skip survey items even when they have the requisite knowledge and understanding of the question to respond (Hitt et al., 2016). In other cases, students simply provide careless or inconsistent answers, such as when they repeatedly use only one response category on a Likert scale or select the same scale response category on two items measuring oppositional constructs, e.g. confidence in math and self-doubt in math (Curran 2015; Hitt, 2016; Zamarro, Cheng et al., 2018).
These two behaviors, which we will call “item nonresponse” and “careless answering,” respectively, can be quantified and provide evidence on how engaged a student is on the survey. 1 Item nonresponse rate is defined as the percentage of items skipped by a student out of the total number of items the student was supposed to answer on a survey (Hitt et al., 2016). Careless answering captures the prevalence of inconsistent answering on a survey for a student. One should note that inconsistent answering can also occur for intentional reasons like mischievous responding (Robinson-Cimpian, 2014), but we do not discuss those intentional occurrences here. 2 Technical details for constructing this measure are described in Hitt (2016) and Zamarro, Cheng et al. (2018). Intuitively, responses to items that are a part of a scale designed to measure a single construct should be correlated with each other. The careless answering measure captures the extent to which the responses are uncorrelated, as in the case where a student always selects the first answer option even when doing so is logically inconsistent given the content of the survey.
Both of these measures have been shown to generate information about SEL competencies related to conscientiousness and grit. Table 4 summarizes the research evidence. Students with higher item nonresponse rates or careless answering scores self-report lower levels of grit and conscientiousness (Cheng & Zamarro, 2018; Hedengren & Stratmann, 2012; Zamarro, Nichols et al., 2018). Partial correlations between these self-reported measures and measures of survey engagement are about 0.20. Although the correlations are not high, they are comparable in magnitude to correlations among SEL survey scores in other studies (Farrington et al., 2012; Soland, et al., 2013; West et al., 2018). This relationship between careless answering and constructs like self-management and grit has also been observed in adulthood through a representative internet panel of American adults (Zamarro, Cheng et al., 2018).
As with measures of disengagement from achievement tests, survey item nonresponse rates and careless answering have also been found to be associated with outcomes like educational attainment, employment, and earnings, even after controlling for cognitive ability and demographic background characteristics (Hedengren & Stratmann, 2012; Zamarro, Nichols et al., 2018). Further, these correlations are not merely contemporaneous. Prior item nonresponse rates and careless answering from middle and high school have been found to predict these downstream life outcomes (Hitt, 2016; Hitt et al., 2016).
Despite the generally significant relationships between survey engagement metadata and relevant SEL concepts described above, we found three papers that showed limited associations. Zamarro, Cheng et al. (2018) only found statistically significant but small associations between survey item nonresponse and self-reported conscientiousness and grit in a nationally representative internet panel of American adults. The authors argued that this could have been an artifact of survey design as item nonresponse is actively discouraged in this internet panel. Marcus and Schütz (2005) examined the relationship between item nonresponse and conscientiousness in a convenience sample of owners of personal Web pages but failed to find any significant association. Finally, Kassenboehmer and Schurer (2018), using data on adults from the Household, Income, and Labour Dynamics in Australia Survey, also found small associations between item nonresponse and self-reported conscientiousness. One should note, however, that these papers with the weakest evidence involved adult samples.
Potential Uses and Limitations of Assessment Engagement Metadata to Generate Data on Social-Emotional Competencies
Before concluding, we enumerate potential uses of assessment metadata to support SEL policy, practice, and research supported by findings from the studies reviewed. Such application is not without important limitations and cautionary notes, which we discuss as well.
Potential Uses of Assessment Engagement Metadata
In order to promote SEL, educators need to be able to measure related competencies. Without such data, educational stakeholders cannot tell if students’ SEL competencies are improving and whether programs to promote those competencies are working. Thus, establishing the relationship between assessment engagement metadata and the four intrapersonal constructs we examine has several practical benefits. We discuss them below.
First, such metadata can be used to help validate student scores from surveys (or other measures) of SEL competencies. Students are often unaware that computer-based assessments capture metadata like response times and proportions of omitted responses. Therefore, not only is self-report bias avoided, but there may also be a lower likelihood that students will behave differently due to awareness of the behavior being quantified. A measure with these properties can prove useful to scrutinizing self-report measures. For example, if a student reports high self-management or conscientiousness but rapidly guesses frequently on an achievement test or omits responses on a survey, then educators might worry about self-report bias. One of the most novel facets of this multiple-measures approach is that metadata from a survey can serve as a check against self-report bias on that same survey (Duckworth & Yeager, 2015). Thus, this approach is inexpensive because it does not require administration of another measure, which means districts can get additional SEL data from the assessment regimens they already have in place.
Second, assessment engagement metadata may also be useful to administrators and teachers by supplementing datasets that do not have SEL scores. That is, practitioners could benefit by gaining a proxy for certain SEL constructs if a district or school does not offer a survey (Soland et al., 2019; Zamarro, Hitt, & Mendez, 2016). Even in the event a school system does measure SEL through a survey or other instrument, those measures are often administered no more than yearly. Such districts could gain additional SEL data between survey administrations by relying on metadata from other assessments. There may be similar benefits for researchers: Many large publicly available datasets do not include scores from SEL measures despite the fact that social-emotional data might support useful research with the dataset (Cheng & Hitt, 2018; Cheng & Zamarro, 2018; Zamarro et al., 2016).
Third, assessment behavior metadata could provide early warning indicators that a student is at risk of academic disengagement. Low academic engagement is associated with reduced educational attainment, including failing to complete high school (Farrington et al., 2012). The early warning literature typically highlights other behaviors like suspensions or absenteeism when trying to identify disengaged students (Allensworth, 2013). Research shows that assessment disengagement behaviors are similar to behaviors in the early warning indicator literature suggesting academic disengagement (Hitt et al., 2016; Soland et al., 2019). Therefore, associated metadata may provide another behavioral early warning indicator of whether a student is academically disengaged and potentially at risk of dropping out. Given how often students are tested in schools currently, these metadata are captured quite frequently, which could also increase their value.
Finally, metadata may be helpful when trying to avoid some of the cultural biases articulated by Gregory and Fergus (2017). For example, students are often unaware metadata are being collected during an assessment, potentially reducing the likelihood that students are reading certain cultural cues and performing to them while the data are produced. We discuss this issue more in the section on future directions for this line of research.
Limitations of Assessment Engagement Metadata
Despite these potential benefits, there are key limitations to using assessment metadata in this fashion. We discuss four of them. First, positive associations between metadata and SEL were found for some, but not all, SEL constructs. The constructs that were significantly correlated with metadata (Table 2) generally related to student attitudes and perceptions about their own academic performance. Yet, even related intrapersonal constructs like growth mindset and instrumental motivation were less strongly associated with metadata. Therefore, results seem dependent on the particular SEL construct in question.
Second, much more validity evidence would need to be collected to argue that these behavioral indicators are actually measures of constructs like self-management, and even then there might be too many confounding factors. As one example, response time metadata can be impacted by context, including constraints that schools or districts place on tests (e.g., when in the day they are administered), which could change behaviors in ways irrelevant to the construct of interest (Wise, 2015). For another, like on performance assessments, the highly standardized nature of the activity may limit generalizability of those behaviors to broader settings.
Beyond contextual factors, there are also related constructs that are correlated with the SEL constructs we mention. Thus, one cannot be sure that the metadata are actually measuring the SEL constructs of interest versus another one. For instance, some of the research discussed in our study examines the connection between metadata and conscientiousness. However, conscientiousness is correlated with obedience (Damian, Negru-Subtirica, Pop, & Baban, 2016) and survey effort measures exhibit some correlation with other traits like neuroticism (Cheng, Zamarro, & Orriens, 2018; Hedengren & Stratmann, 2012; Zamarro, Cheng et al., 2018). One could imagine that the metadata are actually measuring those other constructs, not the four intrapersonal ones in Table 2. Although the emerging literature on metadata and SEL is promising, until more validity evidence is collected that enables better discernment between alternative hypotheses for what metadata measure, one might be safer thinking of those metadata as crude proxies for SEL competencies like conscientiousness rather than as valid measures.
Third, there may not be straightforward ways to reconcile discrepant results from surveys and metadata. For instance, a student may report low self-management yet rapidly guess infrequently, if at all. More still needs to be learned about cross-classification rates between measures. Put differently, assessment disengagement behavior may be insufficient on its own to establish issues with self-management or conscientiousness. At best, one would imagine that such metadata could be part of a multiple-measures approach to assessing SEL.
Fourth, assessment disengagement metadata are not especially helpful in an accountability context because, like survey scores, they can be easily gamed. As Campbell’s Law suggests, the more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures, a finding that has played out in educational testing under various accountability regimens (Nichols & Berliner, 2007). Even if educators and students did not know how assessment metadata were being used exactly, a general awareness of test metadata being used for accountability could incent perverse activities. For instance, educators might coach their students to spend long amounts of time on items or even bubble in items their students left blank (Jones, 2011). Although such responses to the inclusion of metadata in accountability systems may not occur, there is a strong argument to be made that assessment engagement metadata should be used primarily for low-stakes purposes among educators, policymakers, and researchers. As discussed in the previous section, these uses might include validating individual survey scores, supplying data on SEL when none are otherwise available to inform practice and school improvement, and serving as early warning indicators that a student may be academically disengaged.
Conclusion and Future Research
There is increasing evidence that to succeed in life students need to leave school with more than knowledge of academic subject matter (Dweck et al., 2011). We reviewed a growing body of research investigating whether metadata captured when students take achievement tests or surveys can provide insight into intrapersonal SEL constructs related to academic disengagement. Although not true of all SEL constructs reviewed, we generally found that metadata related to assessment engagement can provide useful data on self-efficacy, self-management, conscientiousness, and grit (Hitt, 2016; Hitt et al., 2016; Soland, 2019; Soland et al., 2019; Zamarro, Cheng et al., 2018; Zamarro et al., 2016; Zamarro, Nichols et al., 2018). Studies suggest these assessment engagement metadata may be beneficial as a check against self-report bias on SEL surveys of those four constructs, to supplement SEL data used in practice when available data are sparse or nonexistent, and to serve as early warning indicators that a student may have begun to disengage academically. This literature review suggests there is likely a validity argument to be made for using metadata as a crude proxy for certain SEL constructs, which may be useful to educators as they try to foster SEL. These assessment engagement behaviors are somewhat like a measurement version of natural experiments in economics, which are not designed to randomize students but allow for related inferences anyway.
However, there is still validity evidence that needs to be accrued before relying on such metadata as proxies for associated constructs. For example, we primarily reviewed concurrent validity evidence related to uses of metadata as proxies for SEL competencies. Most studies we reviewed investigated associations between assessment metadata and scores from surveys of various SEL constructs. As pointed out previously, these surveys can suffer from many sources of bias (e.g., Duckworth & Yeager, 2015), which the studies we reviewed do not necessarily address. Therefore, correlations among metadata and SEL survey scores may have been inflated or deflated by biases in those survey scores. For example, in related literature, Kraft (2019) and Gershenson (2016) found that teachers who improve test scores do not necessarily improve self-reported measures of SEL and other related outcomes. Cheng and Zamarro (2018) similarly found that measures of teacher survey effort capture important dimensions of teacher quality through their correlation with classroom-observation measures and principal ratings but did not find significant associations with measures of teacher quality through contributions to standardized test scores. Both results could be due to the self-report nature of the SEL measures used. Given these findings, more work is needed to replicate results using non-self-report SEL measures.
Another piece of validity evidence that should be collected is whether metadata have the potential to reduce cultural biases outlined by the likes of Gregory and Fergus (2017). On one hand, the fact that students are often unaware the metadata are being captured could help with issues related to stereotype threat (Steele & Aronson, 1995). On the other, one might also imagine that these biases would remain if the achievement tests or surveys the metadata come from are based on particular cultural frames of reference. Such a finding has some precedent, given research on stereotype threat (Steele & Aronson, 1995) and cultural and linguistic bias on certain achievement tests (Abedi, 2002). Further, as Gregory and Fergus (2017) point out, measures like metadata do not account for ecological conditions, like the quality of instruction, classroom management strategies, and other contextual factors that may contribute to how hard students are willing to try on an assessment. Initial research already suggests that rates of test disengagement vary considerably by race and biological sex, which may be due to the factors described by Gregory and Fergus (2017), and likely means that achievement gaps based on observed test scores can be biased by differential test engagement (Soland, 2018a, 2018b). Research should investigate these issues as they relate to metadata as proxies for SEL constructs.
Beyond these issues of cultural, racial, and linguistic bias, each potential use of metadata to support SEL in a schooling context should be supported with additional validity evidence. For example, research should further explore how well assessment engagement metadata perform as early warning indicators of dropout relative to more established indicators like chronic absenteeism. Studies might also consider whether inferences about student progress and program effectiveness related to SEL are consistent when metadata are used versus self-report surveys. Such validation work would likely benefit from being conducted in concert with educators who use SEL data to support their practice on a regular basis.
Ultimately, our review of the literature reveals a field still in its infancy. Although there is a growing body of research providing evidence that assessment engagement metadata are correlated with scores from SEL surveys, that concurrent validity evidence still relies on nontrivial assumptions about what those metadata are measuring exactly and should be supported with other types of validity evidence related to their specific intended uses. In the meantime, however, educators, policymakers, and researchers may wish to incorporate such metadata into multiple measures approaches to monitor the SEL needs of students, especially when more established measures of associated constructs are not available.
Footnotes
Appendix
Hits by Combined Search Term for General SEL Terms
| SEL Search Term |
||||
|---|---|---|---|---|
| Engagement Search Term | Social-Emotional* | Socio-Emotional* | Intrapersonal | Big 5 |
| Assessment metadata | 2 | 0 | 1 | 0 |
| Test metadata | 2 | 0 | 0 | 0 |
| Survey metadata | 1 | 0 | 9 | 1 |
| Item response time | 12 | 4 | 6 | 2 |
| Survey nonresponse | 125 | 81 | 99 | 1,260 |
| Assessment engagement | 74 | 24 | 42 | 1 |
| Test engagement | 38 | 4 | 25 | 1 |
| Survey engagement | 18 | 2 | 29 | 2 |
Note. We also searched for “metadata” synonyms like “paradata.”
