Discovery With Models

Abstract

In recent years, an increasing number of analyses in learning analytics and educational data mining (EDM) have adopted a “discovery with models” approach, where an existing model is used as a key component in a new EDM or analytics analysis. This article presents a theoretical discussion on the emergence of discovery with models, its potential to enhance research on learning and learners, and key lessons learned in how discovery with models can be conducted validly and effectively. We illustrate these issues through discussion of a case study where discovery with models was used to investigate a form of disengaged behavior (i.e., carelessness) in the context of middle school computer-based science inquiry. This behavior was acknowledged as a problem in education as early as the 1920s. With the increasing use of high-stakes testing, the cost of student carelessness can be higher. For instance, within computer-based learning environments, careless errors can result in reduced educational effectiveness, with students continuing to receive material they have already mastered. Despite the importance of this problem, it has received minimal research attention, in part because of difficulties in operationalizing carelessness as a construct. Building from theory on carelessness and a Bayesian framework for knowledge modeling, we use machine-learned detectors to predict carelessness within authentic use of a computer-based learning environment. We then use a discovery with models approach to link these validated carelessness measures to survey data to study the correlations between the prevalence of carelessness and student goal orientation.

Keywords

discovery with models learning analytics educational data mining carelessness science inquiry goal orientation

The recent explosion of student interaction data derived from educational software has resulted in a proliferation of models developed through educational data mining (Baker & Yacef, 2009; Romero & Ventura, 2007, 2010) and learning analytics (Siemens & Long, 2011) techniques. These relatively new fields provide analytic techniques to operationalize a range of learner behaviors, attributes, and states during the learning process. In recent years, these models have proven to be useful for providing formative data to instructors and for the design and implementation of automated learning interventions (cf. Arnold, 2010; Arroyo, Woolf, Cooper, Burleson, & Muldner, 2011; Militello & Heffernan, 2009). Increasingly, the establishment of these models has also become a powerful tool for making scientific discoveries about learning and learners, through studying the contexts in which a learning-related behavior occurs or construct emerges, and through examining its relationships with other constructs. Baker and Yacef (2009) termed the use of an existing EDM or analytics model as a component in a new EDM or analytics analysis “discovery with models” and noted its increasing prevalence in EDM research. Discovery with models has been used to study a range of constructs, including help-seeking strategies (Aleven, Mclaren, Roll, & Koedinger, 2006), off-task behavior (Baker, 2007b; Baker & Gowda, 2010), patterns of usage of online resources (Jeong & Biswas, 2008; Kinnebrew, Biswas, & Sulcer, 2010), and social networks (Dawson, Macfadyen, Lockyer, & Mazzochi-Jones, 2011).

In this article, we illustrate discovery with models methods drawing on examples involving two constructs: gaming the system (Baker, 2007b; Baker, Corbett, & Koedinger, 2004; Baker & Gowda, 2010; Baker, Walonoski, et al., 2008; Muldner, Burleson, Van de Sande, & VanLehn, 2011) and carelessness (cf. Georges, 1929; McClure, 1929). Gaming the system is a student’s behavior aiming at succeeding in an educational task by systematically taking advantage of properties and regularities in the system used to complete that task, rather than by thinking through the material (e.g., by systematically guessing answers to obtain the correct response or repeatedly requesting hints until the software presents the answer; Baker et al., 2004). The second construct, carelessness, refers to incorrect answers given by a student on material that the student should be able to answer correctly (Newman, 1977). Careless errors are possibly the result of answering impulsively or with insufficient effort (cf. Rodríguez-Fornells & Maydeu-Olivares, 2000).

The application of discovery with models involves two main phases. First, a model of a construct is developed using machine learning or knowledge engineering techniques and is then validated, as discussed below. Second, this validated model is applied to data and used as a component in another analysis: for example, for identifying outliers through model predictions; examining which variables best predict the modeled construct; finding relationships between the construct and other variables using correlations, predictions, associations rules, causal relationships, or other methods; or studying the contexts where the construct occurs, including its prevalence across domains, systems, or populations. For example, in Baker, Walonoski, et al. (2008), data from a validated machine-learned model of gaming the system were applied to data from an online learning system and used in a correlational analysis, in combination with survey data, to determine which of a range of motivational and attitudinal variables were associated with gaming.

One essential question to pose prior to a discovery with model analysis is whether the model adopted is valid, both overall and for the specific situation in which it is being used. Ideally, a model should be validated using an approach such as cross validation, where the model is repeatedly trained on one portion of the data and tested on a different portion, with model predictions compared to appropriate external measures, for example, assessments made by humans with acceptably high interrater reliability, such as field observations of student behavior for gaming the system (cf. Baker et al., 2004; Baker, Corbett, Roll, & Koedinger, 2008), or future student performance. Validating using measures external to the model itself increases the confidence that the model is genuinely assessing what it was intended to measure. By testing generalizability to unseen data, one can ensure that a model will be usable in new contexts. When testing a model’s generalizability using an approach such as cross validation, it is important to divide the data into training and test folds at the appropriate level. For example, to ensure that a model can be validly applied to new students, the model should be cross-validated at the student level; for example, each student appears in either the training or the test set at a specific time. Even after validating in this fashion, validity should be reconsidered if the model is used for a substantially different population or context than was used when developing the model.

An alternative approach is to use a simpler knowledge-engineered definition, rationally deriving a function or rule that is then applied to the data. In this case, the model can be inferred to have face validity. However, knowledge-engineered models often produce different results than do machine learning-based models, for example, in the case of gaming the system. Research studying whether student or content is a better predictor of gaming the system identified different results, depending on which model was applied (cf. Baker, 2007a; Gong, Beck, Heffernan, & Forbes-Summers, 2010; Muldner et al., 2011). Of course, knowledge-engineered and machine-learned models do not always disagree; for instance, both types of models have shown similar relationships between gaming and learning or motivation–attitude (Aleven et al., 2006; Baker, Walonoski, et al., 2008; Cocea, Hershkovitz, & Baker, 2009). When differences emerge, it is often because knowledge-engineered models and machine-learned models do not quite predict the same constructs: There is a trade-off between the knowledge-engineered models’ clarity and the machine learning models’ precise prediction, which is based on finding complex and unexpected relationships.

Discovery with models has several advantages as a method. First, it allows for scientific investigation to be made using definitions that can be discussed and inspected, so long as models used in this fashion are published. Second, discovery with models leaves clear data trails that can be reinspected later; for example, Muldner and colleagues (2011) disagreed with Baker’s (2007a) definition of gaming the system and ran their alternate model on the same data set to compare the two models’ predictions in detail. Third, some constructs that are difficult for humans to label by hand (such as latent student knowledge) can be modeled and then studied. Fourth, discovery with models enables analyses to occur at scales or for conditions that are otherwise infeasible. Collecting human labels of constructs such as the ones above can be very time-consuming; but once a model is developed, it can be quickly applied at a wide scale on additional data. For example, Baker and Gowda (2010) analyzed gaming the system and off-task behaviors across an entire year of data from three schools, representing approximately 1 million data points.

The potential of discovery with models methods for studying learning and engagement in computer-based learning can be seen in the context of researching gaming the system (Baker et al., 2004), where a range of research that would be close to intractable to conduct through other methods has been possible for this construct. Models for gaming the system have been developed by multiple groups using both machine learning (Baker, Corbett, Roll, et al., 2008; Baker et al., 2004; Baker, Mitrovic, & Mathews, 2010; Beal, Qu, & Lee, 2008; Walonoski & Heffernan, 2006) and knowledge engineering approaches (Aleven et al., 2006; Beck, 2005; Johns & Woolf, 2006; Muldner et al., 2011). These models have then been used to study how the prevalence of gaming changes during the year (Beck, 2005), what the relationships are between a range of motivational and attitudinal variables and gaming (Baker, Walonoski, et al., 2008; Beal et al., 2008), the mechanisms through which gaming affects learning (Cocea et al., 2009), whether gaming is more determined by the student or the content (Baker, 2007a; Gong et al., 2010; Muldner et al., 2011), and the degree to which students in different settings game the system to different degrees (Baker & Gowda, 2010).

A Case Study on Discovery With Models: Carelessness and Student Motivation

In this article, we demonstrate the potential of discovery with models for contributing to theory on learning through its use in researching carelessness. Despite accounts arguing for its prevalence and importance for almost a century (cf. Georges, 1929; McClure, 1929), carelessness has been lightly researched, in part because of the complexity of operationalizing it. Generally, carelessness is described as giving the wrong answer despite having the needed skills for answering correctly. Better understanding the motivations that lead students to engage in careless behavior may lead to interventions that can reduce carelessness’s frequency or mitigate its negative effects.

Prior research has inferred carelessness based on an individual’s performance on repeated assessments (Clements, 1982; Newman, 1977), observations and interviews (Georges, 1929), and self-report measures, or teachers’ judgments, that treat carelessness as a stable learner characteristic (Hurlock & McDonald, 1934; Maydeu-Olivares & D’Zurilla, 1996; McClure, 1929; Rodríguez-Fornells & Maydeu-Olivares, 2000). In the behavioral assessment of carelessness used in Clements (1982) and Newman (1977), the student is administered the same item multiple times. This method is time-consuming and requires a special assessment outside of the regular learning task. It also may not be representative of carelessness occurring in other contexts: Students might get bored of answering the same item, potentially inducing carelessness that would not have occurred otherwise.

Treating carelessness as a stable individual difference, besides reducing opportunities to understand the role of context in carelessness, is prone to demand effects (participants may perceive that there is an expectation of them to behave in a certain fashion) and self-presentation effects (answers given are influenced by research respondents’ desire to present themselves positively).

However, recent work in data mining and machine learning has made it possible to study carelessness within specific computer-based learning tasks, without modifying the learning task, using automated carelessness detectors (cf. San Pedro, Baker, & Rodrigo, 2011) and discovery with models methods. We utilize this approach to study the relationships between carelessness and students’ motivation, in particular their self-identified learning goals. In particular, we study the correlations between the prevalence of carelessness and student goals, and how student carelessness changed over time.

These issues are studied in the context of students engaging in scientific inquiry in a computer-based learning environment, Science Assistments (www.scienceassistments.org; Sao Pedro, Baker, Gobert, Montalvo, & Nakama, in press). In this context, carelessness is defined as failing to demonstrate a certain inquiry-related skill (specifically, control for variables strategy and hypothesis testing) despite knowing the skill (as measured by a Bayesian model of student knowledge). Carelessness is modeled by a machine-learned decision tree, which is used for discovering relationships between carelessness and student attributes (goals, beliefs, and motivation).

Method

Participants

Participants were 148 eighth grade students, 12 to 14 years old, from a public middle school in suburban Massachusetts. This school is majority White, with about 30% of the students from low-income families and 30% on a free or reduced-price lunch program. The town where the school is located has a median income much higher than the U.S. median, and fewer than 2% of the families in the town are below poverty level (compared to about 10% in the United States). School achievement in science is slightly lower than the state average, with 35% of the students categorized as proficient or higher on the state standardized exam, the MCAS (state average = 40%). Students belonged to one of six class sections and had one of two science teachers. Students had no previous experience using microworlds with Science Assistments.

Materials and Design

Phase Change Microworld

The study was conducted in the context of students’ scientific inquiry within a physical science “microworld,” a virtual laboratory or simulation in which the student can run guided scientific experiments (cf. Sao Pedro et al., in press). In this study, we used a phase change microworld, where students explore how a particular substance, such as water, changes phases from solid to liquid to gas as it is heated. Each task in the microworld requires students to conduct experiments to determine if a particular independent variable (container size, heat level, substance amount, and cover status) affects various outcomes (melting point, boiling point, time to melt, and time to boil). For a given independent variable, students demonstrated proficiency by hypothesizing, collecting data, reasoning with tables and graphs, analyzing data, and communicating their findings. For this microworld, automated detectors of two key scientific inquiry skills—designing a controlled experiment and testing the stated hypothesis—were developed and validated (Sao Pedro et al., in press).

Motivation and Goal Orientation Measures

Students completed the Patterns of Adaptive Learning Scales (PALS) survey (Midgley et al., 2000) in class as baseline data collected at the beginning of the school year, 3 months prior to using the phase change microworld. We analyze data from two scales capturing students’ goal orientation and their beliefs and attitudes toward learning. The first is the Personal Achievement Goal Orientation (PALS1), including learning goal orientation, the goal of developing skill or learning (five items), performance–approach goal orientation, the goal of demonstrating competence (five items), and performance–avoid goal orientation, the goal of avoiding demonstrating incompetence (four items). The second scale measures Academic-Related Perceptions, Beliefs, and Strategies (PALS4), including academic efficacy (five items), avoiding novelty (five items), disruptive behavior (five items), self-presentation of low achievement, the desire to prevent peers from knowing how well the student is performing (seven items), and skepticism about the relevance of school for future success (six items). Each item has a 5-point Likert-type scale, and each of the above eight subscales was calculated using the mean of each student’s answers to the items for that subscale.

Procedure

After receiving a short introduction on using the tutor and on the activity, all students were engaged in the phase change learning activities over two class periods, about 70 minutes in total. During this time, Science Assistments logged all students’ interactions within the learning environment. In the following sections, we show how we used these fine-grained interaction data to construct and validate machine-learned detectors of carelessness, based on existing detectors of systematic data collection behavior (Sao Pedro, Baker, Montalvo, Nakama, & Gobert, 2010; Sao Pedro et al., in press). Next, we analyze carelessness and its manifestation over consecutive activities, with regard to differences between students in motivation and goal orientation.

Carelessness Model Development

The process of building an automated detector of carelessness involves several steps. In the final step, a detector of carelessness is developed using machine learning. To reach this step, we compute training labels for carelessness, developed based on (a) information on student correctness over time (e.g., whether the student succeeds or fails to demonstrate inquiry skills) and (b) information from an automated assessment of student knowledge, called Bayesian knowledge tracing (BKT; see Corbett & Anderson, 1995). These labels of correctness–incorrectness are developed by aggregating students’ actions in Science Assistments into “clips” (discussed next), which are labeled by machine-learned models of inquiry skill(s). These machine-learned models of inquiry skill(s) are derived from hand-coded assessments of correct or incorrect application of science inquiry skills. In the following sections, we discuss each of these steps.

Moving From Fine-Grained Log Files to Coarse-Grained Clips

As students used Science Assistments, each individual’s actions within the microworld were logged at a fine-grained level, with a focus on student actions while experimenting. Each action’s type, current and previous values (where applicable—e.g., an independent variable’s value before and after the action), and time stamp were recorded. In all, 27,257 relevant actions were logged, from 148 students. These data served as the basis for generating representations of student behavior that could be coded by humans in terms of inquiry skill, and then for generating machine-learned models of these inquiry skills.

Student behavior was segmented into a set of 1,503 clips, each of which consisted of all student experimentation between the time the student entered and left the experimentation phase of the microworld. (Students could have multiple clips for a specific activity if they chose to exit and reenter experimentation, either to create new hypotheses or collect more data.)

These clips were formatted as “text replays” (cf. Baker, Corbett, & Wagner, 2006) and labeled by two human coders based on whether a student demonstrated scientific inquiry skill during the scope of that clip (a binary variable). For example, one indication of skill in designing controlled experiments is the use of the control of variables strategy, a method for creating experiments in which the value of a single variable is changed between consecutive steps (Chen & Klahr, 1999). However, other methods that generate unconfounded comparisons were also treated as evidence of skill in designing controlled experiments, for example, modifying three variables and then changing two variables. The coders separately coded a second skill, whether the student tested his or her stated hypothesis (Collins & Stevens, 1991). The coders trained together on a set of clips, discussing their definitions of the construct based on specific examples and past literature (cf. Chen & Klahr, 1999; Collins & Stevens, 1991). They then coded clips separately to establish interrater reliability. The corpus of hand-coded clips contained exactly one randomly selected clip from each problem each student encountered, resulting in 571 clips; the two coders each coded the same subset of clips and attained acceptable interrater reliability (κ = .69 for designing controlled experiments, κ = 1.00 for testing stated hypothesis).

The labels were then used as a basis for developing machine-learned models of these two inquiry skills (Sao Pedro et al., 2010; Sao Pedro et al., in press). The detectors produced were able to distinguish each of these behaviors 85% of the time (i.e., area under the curve [AUC] = 0.85 for each skill). Afterward, a BKT model (Corbett & Anderson, 1995) was created to predict a student’s latent knowledge of these skills. The BKT models were able to predict future correctness on these skills more than 70% of the time (AUC = 0.74, 0.79).

Feature Distillation

Each clip had a set of 73 features extracted for the machine-learning process (cf. Sao Pedro et al., 2010), including the numbers of different types of actions that occurred during the clip (including the number of complete and student-interrupted trials and the number of variable changes made while designing each experiment), the timing of each action (including the average time per variable change and the maximum time the student spent studying the simulation), and the probability that the student knew the skill involved with the relevant problem set before his or her first attempt on action N, P(Ln-1), calculated using BKT (Corbett & Anderson, 1995).

Carelessness Detector

Our operational measure of carelessness is based on a detector of whether an incorrect answer was a result of a lack of knowledge or not, based on features of the student response, previously termed contextual slip (cf. Baker, Corbett, & Aleven, 2008). This model was previously used to estimate carelessness in intelligent tutors for mathematics (Baker & Gowda, 2010; San Pedro et al., 2011). The term slip is used within multiple student modeling approaches, in both the Bayesian modeling literature and the psychometrics literature, to denote when the student makes an error despite knowing the requisite skills for correct performance (Corbett & Anderson, 1995; De La Torre & Douglas, 2004; Junker & Sijtsma, 2001; Morgan, 1979). This definition is essentially the same definition used for carelessness by Newman (1977). Although slipping may not be a perfect indicator of carelessness (as a student could still obtain the correct answer despite being careless, if Maydeu-Olivares’s definition is used rather than Newman’s), there are relatively few explanations for student errors on well-known skills, thereby making slipping a sound indicator of carelessness, if not a complete one. Slips can also imply a poor knowledge model (an issue studied by De La Torre & Douglas, 2004). We circumvent this possibility by studying slipping within a learning system that has a validated knowledge model to accurately predict future correctness 74% of the time (cf. Sao Pedro et al., in press), a level of correctness within the range of the current state of the art (cf. Pardos, Baker, Gowda, & Heffernan, 2011). Within our approach, training labels are computed by taking the probability that the student knew the skill prior to the attempt (computed using BKT) and integrating this probability with data on future correctness on the next two problem solving attempts, using derivatives of Bayes’s theorem (the full mathematics for this process is outlined in Baker, Corbett, & Aleven, 2008). Next, machine learning is used to develop detectors that predict carelessness without using future data.

A detector predicting carelessness was developed using the REPTree (Reduced Error Pruning Tree) algorithm within the Weka extension package (Witten, Frank, & Hall, 2011) in RapidMiner 5.0 (Mierswa, Wurst, Klinkenberg, Scholz, & Euler, 2006). REPTree is a fast algorithm for constructing a decision tree for predicting the value of a numerical variable based on a set of other variables. Sixfold cross validation was conducted at the student level (i.e., the detector is trained on five groups of students and tested on the sixth group) to assess whether the detector would remain accurate for new groups of students. An alternate algorithm, linear regression, was found to have systematic patterns in the model residuals for subsets of the data, which suggested that a nonlinear model might be more effective. Indeed, REPTree outperformed linear regression, achieving a better cross-validated correlation (r = .62 for REPTree, r = .56 for linear regression) and root mean square error (RMSE = .16 for REPTree, RMSE = .18 for linear regression).

The resulting REPTree had a size of 13 (including both interior nodes—decision points—and leaves—linear regression models) and a total depth of 6. Overall, the model used three variables:

P(Ln-1) the probability the student knew the inquiry skill before the action

Cumulated count of conducting two consecutive trials that vary in terms of only one variable

Sum of timing for variable changes while constructing hypotheses

A prediction about carelessness was made for each clip, and then each student’s predicted carelessness was computed by taking the average values over all of that student’s clips. Overall, the mean carelessness probability across the studied clips (N = 1,282) was 0.08 (SD = 0.14), which means that 8% of the errors across clips were detected as careless errors. When averaged across students, the mean carelessness was 0.12 (SD = 0.16), which indicates that, on average, 12% of students’ errors were detected as careless errors. This value represents considerably lower levels of carelessness than were seen in mathematics intelligence tutors or mathematics tests, in which carelessness levels were between 12% and 50% (Baker & Gowda, 2010; Casey, 1979; Clarkson, 1983; Clements, 1980; Newman, 1977; San Pedro et al., 2011). Further exploration is needed to understand this difference.

Findings

The Relationship Between Carelessness and Motivational Measures

To examine the relationships between careless and motivation, we calculated the correlations between carelessness and the eight subscales of the PALS discussed above; the results are summarized in Table 1. Two of the subscales were marginally significantly correlated with carelessness:

Carelessness was marginally negatively correlated with disruptive behavior, r = –.15, F(1, 128) = 3.02, p = .08; that is, more careless students tended to report less tendency to engage in disruptive behavior.

Carelessness was marginally negatively correlated with self-presentation of low achievement, r = –.16, F(1, 128) = 3.45, p = .07; that is, those with higher scores on carelessness tended to have lower scores on self-presentation of low achievement.

Table 1.

Correlation Coefficients Between Carelessness and Patterns of Adaptive Learning Scales (PALS) Measures (N = 130).

PALS measure	r
Learning goal orientation	.08
Performance–approach goal orientation	.04
Performance–avoid goal orientation	.001
Academic efficacy	.14
Avoiding novelty	−.13
Disruptive behavior	−.15^a
Self-presentation of low achievement	−.16^b
Skepticism about school relevance for future success	−.12

p = .08.

p = .07.

Both disruptive behavior and self-presentation of low achievement have been previously shown to be related to low achievement (Ketter, 2006). Hence, one possible interpretation of the negative correlation between these constructs and carelessness is that these constructs lead to students making incorrect answers resulting from insufficient inquiry skill, instead of carelessness. We present an analysis of this possibility later in the article.

As seen in Table 1, none of the three goal orientations assessed by PALS were statistically significantly correlated with carelessness: r = .08 for learning goals, F(1, 128) = 0.68, p = .36; r = .04 for performance–approach goals, F(1, 128) = 0.20, p = .65; and r = .001 for performance–avoidance goals, F(1, 128) = 0.00, p = .995. However, it may be that a straightforward linear correlation does not capture the relationships between goal orientation and carelessness. Thus, to explore this possibility, in the following section we examine whether there are more complex relationships between these constructs. Such complex relationships might be expressed when subsets of variables are considered (as opposed to single-variable-based calculations).

Relationships Between Carelessness and PALS-Based Clusters

We can investigate the relationship between motivation and carelessness by looking at whether combinations of student motivations can predict carelessness. To analyze this, we conducted analyses to search for clusters of students who were similar in terms of responses on multiple motivational and goal measures and studied whether there were differences in carelessness between these groups. This was done using cluster analysis (Everitt, Landau, Leese, & Stahl, 2011), where data points (in this case students) are assigned to data subsets (called clusters) based on intergroup similarities.

Based on a log-likelihood distance measure and the Bayesian information criterion (BIC; Raftery, 1995), using two clusters provided the optimal explanation of the data (BIC = 705.6). However, using three clusters led to only minor degradation in the BIC value (BIC = 711.9). The set of three clusters led to more interesting separations between aspects of the PALS (using two clusters essentially merged Clusters 2 and 3); therefore, we analyze the three-cluster model, as shown in Table 2.

Table 2.

Center Values (means) of Each Construct for the Clusters.

Variable	M (SD)			Difference between groups F(2, 118)	η² (between-group effect size)
Variable	Cluster 1	Cluster 2	Cluster 3	Difference between groups F(2, 118)	η² (between-group effect size)
Learning goal orientation	4.66 (0.40)	4.38 (0.64)	2.07 (0.87)	46.49**	.44
Performance–approach goal orientation	1.69 (0.57)	3.20 (1.04)	2.40 (0.82)	36.14**	.38
Performance–avoid goal orientation	1.86 (0.72)	3.78 (0.67)	3.62 (0.68)	92.55**	.61
Academic efficacy	4.41 (0.49)	4.22 (0.55)	3.65 (1.06)	13.13**	.18
Avoiding novelty	1.96 (0.60)	2.58 (1.00)	3.02 (1.21)	21.76**	.27
Disruptive behavior	1.54 (0.68)	1.61 (0.68)	2.07 (1.01)	27.57**	.32
Self-presentation of low achievement	1.33 (0.31)	1.59 (0.60)	3.43 (1.00)	8.93**	.13
Skepticism about the relevance of school for future success	1.57 (0.49)	1.92 (0.82)	2.07 (0.87)	39.17**	.40
N	35	66	20
Carelessness	0.16 (0.22)	0.12 (0.13)	0.05 (0.05)

Shaded cells represent the cluster with the highest value for the current variable.

p < .01.

Cluster 1 (n = 35) includes students who have high values for both learning goal orientation and academic efficacy and can be referred to as the learning goal orientation cluster. Cluster 2 (n = 66) includes students with (relatively) high values for both performance–approach and performance–avoid goal orientations; these two types of goal orientation are often possessed by the same students (Bong, 2001; Darnon, Harackiewicz, Butera, Mugny, & Quiamzade, 2007; Elliot & Church, 1997; Middleton & Midgley, 1997; Ross, Shannon, Salisbury-Glennon, & Guarino, 2002). Though Cluster 2 also has high values for learning goals, its values for performance goals are substantially higher than the values of the other clusters. We refer to Cluster 2 as the performance goal orientation cluster. Finally, Cluster 3 (n = 20) includes students with high values for avoiding novelty, disruptive behavior, self-presentation of low achievement, and skepticism about the relevance of school for future success. Cluster 3’s values for performance–avoidance goals are also almost as high as Cluster 2’s values, but the values for learning goals and performance–approach goals are substantially lower. As such, we refer to this cluster as having a lower tendency of either goal orientation.

The proportion of carelessness was marginally significantly different across the three clusters, F(2, 118) = 3.05, p = .051. The mean carelessness in Cluster 3 (5%) was significantly lower than the mean carelessness in Cluster 2 (12%), t(82.3) = 3.45, p < .01, and in Cluster 1 (16%), t(38.9) = 2.8, p < .01. No significant differences were found between Clusters 1 and 2, t(47.1) = 1.0, p = .32. In all cases, the F of Levene’s test of equality of variances was significant at p < .05, hence equal variances were not assumed.

Taken together, our results suggest that students with strong goal orientation toward learning (Cluster 1) or performance (Cluster 2) were on average twice as careless as those demonstrating a lower tendency to either type of goal orientation (Cluster 3)—a surprising finding. One possible interpretation for this finding is that lower carelessness in students without learning or performance goals is a result of these students having lower inquiry skills in general. By our operational definition, students completely lacking in a skill cannot demonstrate carelessness. Similarly, Clements (1982) found that carelessness was associated with better knowledge.

To address this possibility, we can compare students’ inquiry skills over consecutive activities in each condition, measured as the mean values of P(L_n-1)—averaged over consecutive activities for each student—over the different clusters. P(L_n-1) was in fact significantly lower in Cluster 3 (M = 12%, SD = 10%) than in Cluster 1 (M = 24%, SD = 27%), t(47.1) = 2.4, p < .05, or Cluster 2 (M = 22%, SD = 22%), t(71.3) = 2.7, p < .01. Hence, these results imply that the differences in carelessness between clusters may be a result of the observed differences in student inquiry skills. Further evidence for this can be seen in the changes in carelessness over time in the three clusters. The subset of 106 students who completed the first three activities was taken, and carelessness was plotted over time. As shown in Figure 1, carelessness increases considerably over the course of the three activities for Cluster 1 and Cluster 2 but does not increase over time for Cluster 3. Students in Cluster 3 are not developing the degree of inquiry skill that makes careless errors feasible.

Figure 1.

Students’ mean carelessness values over consecutive activities by clusters.

Discussion

In this article, we have presented a case study in discovery with models: studying the motivations associated with carelessness in computer-based science inquiry. In discovery with models, a model is developed and applied to data and then used as a component in other analyses, typically to discover aspects of the construct in the model. In this case, we developed and applied a machine-learned detector of student carelessness to log files and correlated its outputs to motivational questionnaires to discover the relationships between motivational measures and carelessness.

A key benefit of discovery with models is being able to study behavioral constructs in a nondisruptive fashion that is scalable, longitudinal, and fine-grained. In this specific case, using the automated detector allowed us to study carelessness in a more naturalistic and scalable fashion than was possible through interviews and repeated assessment of the same items (cf. Clements, 1982; Newman, 1977) and in a finer-grained fashion than was possible through questionnaire measures of carelessness (Maydeu-Olivares & D’Zurilla, 1996; Rodríguez-Fornells & Maydeu-Olivares, 2000). Carelessness is a difficult construct to label by hand but can be inferred—as it is here—by the pattern of student correctness and the probability that student errors were not a result of a lack of knowledge. In this article, both the carelessness model and the student knowledge model used in analyses were explicitly validated in the data set where they were applied, an essential step for validly using a model in an analysis of this nature. Models can be applied more widely as well, if attention is paid to demonstrating generalizability for the type of transfer being conducted. Our carelessness model is based on the concept and mathematics of BKT, a widely used algorithm for modeling student knowledge, which can be reconstructed in any learning environment in which students repeatedly interact with problems that have correct answers (or behaviors) that can be automatically detected. Hence, it is feasible to extend the work presented here to a variety of topics, domains, and populations.

Our results (the “discovery” in discovery with models) indicate that certain types of students are linked with carelessness but simultaneously suggest a substantial role for student knowledge in carelessness. We find that student learning orientation can be described by three clusters: students characterized by high levels of learning goal orientation and academic efficacy, students characterized by high levels of both performance–approach and performance–avoid goals, and students lower in both types of goals. Students in the clusters characterized by learning or performance goals have (on average) double the probability of carelessness as students in the third cluster. One potential interpretation is that students with higher mastery or performance goals succeed in learning and correspondingly become more confident, as suggested by Clements (1982), and that this confidence leads to carelessness, despite their identified goal orientations. Our results suggest that students in these two clusters achieve higher learning, particularly over time, and that this factor may drive the differences in carelessness. It is also worth noting that academic efficacy, also measured by PALS (Midgley et al., 2000), is higher in these two clusters.

It is surprising that carelessness was substantially less common in our data set than in previous research (cf. Baker & Gowda, 2010; Clements, 1982; San Pedro et al., 2011). It may be that students solving problems in mathematics—the domain used by Baker (Baker & Gowda, 2010; San Pedro et al., 2011) and by Clements (1982)—tend to be more careless than students in science inquiry because of differences between these two domains. By its very nature, scientific inquiry is exploratory and involves active self-direction in searching for knowledge (Anderson, 2002; Haury, 1993; Novak, 1964), potentially reducing carelessness. Alternatively, some aspect of the science microworld or its novelty may have reduced the amount of carelessness exhibited.

The relationships found here between goal orientation and carelessness worth are comparing to those found in previous research on other disengaged behaviors, much of it also conducted with a discovery with models approach. Neither gaming the system nor off-task behaviors were found to be correlated with performance and learning goals (Baker, 2007b; Baker, Walonoski, et al., 2008). Hence, our results provide somewhat of a contrast to these previous results. One difference between carelessness and these behaviors may be the situations in which carelessness occurs: Although gaming often occurs when a student has not yet learned a skill (Baker et al., 2004) and off-task behavior is distributed throughout the learning experience, careless errors occur after a skill has been learned. Similarly, other measures of disengagement appear to have different relationships to goal orientation than seen here for carelessness. Luo, Paris, Hogan, and Luo (2011) found that students with performance–approach goals report paying more attention to work in class and homework than students with neither learning goals nor performance–approach goals (e.g., the students corresponding to the third cluster). These findings are seemingly inconsistent with the findings in this article; one possible explanation is that the different results stem from using a very different measurement approach (self-report questionnaires) or the difference in domain. Overall, these results demonstrate that carelessness and learning orientation have surprising relationships, which merit further investigation. Understanding the factors that lead successful students to become careless may support the design of interventions to assist these students in maintaining high performance even when they have learned the material.

More broadly, this article demonstrates the potential of discovery with models analyses for conducting analyses of learning behaviors that are difficult to study with more traditional methods. In the current article, the learning behavior examined is carelessness, the model used to study it is a machine-learned decision tree, and our “discovery” is evidence on the relationships between a student’s carelessness and his or her goals, beliefs, and attitudes. By using discovery with models in combination with other methods, we may be able to shed light on questions that have thus far been intractable. As such, we view discovery with models as a key learning analytics method for 21st-century learning science.

Footnotes

Acknowledgements

We would like to thank Matthew Bachmann for his help in data coding.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by National Science Foundation Grant DRL-1008649, “Empirical Research: Emerging Research: Using Automated Detectors to Examine the Relationships Between Learner Attributes and Behaviors During Inquiry in Science Microworlds,” awarded to Janice Gobert and Ryan Baker. Any opinions are those of the authors and do not necessarily reflect those of the funding agency.

Author Biographies

Arnon Hershkovitz is a postdoctoral research associate at Teachers College, Columbia University, and an affiliated postdoctoral research fellow with the Pittsburgh Science of Learning Center. Previously, he was a postdoctoral research fellow in the Learning Sciences and Technologies Program at Worcester Polytechnic Institute. He earned his PhD in science education from the School of Education at Tel Aviv University; he earned his MA in applied mathematics and his BA (cum laude) in mathematics and computer science from the Technion–Israel Institute of Technology. His current research is mostly focused on using data mining techniques for studying various aspects of students’ engagement and affect while using computer-based learning systems.

Ryan Shaun Joazeiro de Baker is the Julius and Rosa Sachs Distinguished Lecturer at Teachers College, Columbia University. He earned his PhD in human-computer interaction from Carnegie Mellon University’s School of Computer Science and was a postdoctoral fellow in the Learning Sciences at the University of Nottingham. He earned his bachelor’s degree (ScB) in computer science from Brown University. He has been assistant professor of psychology and the learning sciences at Worcester Polytechnic Institute. He previously served as the first technical director of the Pittsburgh Science of Learning Center DataShop, the largest public repository for data on the interaction between learners and educational software. He is currently serving as the founding president of the International Educational Data Mining Society and as associate editor of the Journal of Educational Data Mining.

Janice Gobert is an associate professor of learning sciences and psychology in the Department of Social Sciences and Policy Studies and the Department of Computer Science at Worcester Polytechnic Institute. She is the co-director of the Learning Sciences and Technologies Program. Her PhD from the University of Toronto (1994) is in cognitive science; her master’s is from McGill University, also in cognitive science. Her specialty is in technology-based, with visualizations and simulations in scientific domains; her research areas are intelligent tutoring systems for science, skill acquisition, performance assessment via log files, learning with visualizations, learner characteristics, and epistemology. Before coming to WPI, she was an assistant professor at Western Michigan University, a visiting scholar at Harvard University’s School of Education, and senior research scientist at the Concord Consortium (). Until December 2007, she served as North American editor for the International Journal of Science Education. She has been principal investigator on many projects that address technology-based science learning and assessment.

Michael Wixon is finishing a master’s degree in learning sciences and technologies at Worcester Polytechnic Institute, advised by Janice Gobert and Ryan Baker. His interests center on disengagement and affect. He has collaborated on some detectors of affect and a detector of without thinking fastidiously–type behaviors, a form of disengaged behavior.

Michael Sao Pedro is CTO of Apprendis LLC and is completing a PhD in the learning sciences interdisciplinary program at Worcester Polytechnic Institute. Prior to these positions, he was employed as a senior software engineer for BAE Systems, Advanced Information Technologies (AIT), formerly ALPHATECH, Inc. His current research focuses on leveraging data mining techniques to assess and track students’ scientific inquiry skills as they experiment with computer-based simulations. More generally, he is interested in skill assessment and prediction within intelligent tutoring systems for ill-defined domains and understanding how new educational technology integrates within classrooms and curricula.

References

Aleven

Mclaren

Roll

Koedinger

(2006). Toward meta-cognitive tutoring: A model of help seeking with a cognitive tutor. International Journal of Artificial Intelligence in Education, 16(2), 101-128.

Anderson

R. D.

(2002). Reforming science teaching: What research says about inquiry. Journal of Science Teacher Education, 13(1), 1-12.

Arnold

K. E.

(2010). Applying academic analytics. Educause Quarterly, 33(1). Retrieved from http://www.educause.edu/ero/article/signals-applying-academic-analytics

Arroyo

Woolf

B. P.

Cooper

Burleson

Muldner

(2011). The impact of animated pedagogical agents on girls’ and boys’ emotions, attitudes, behaviors, and learning. In Proceedings of the 11th IEEE Conference on Advanced Learning Technologies (pp. 506-510). New York, NY: Institute of Electrical and Electronics Engineers.

Baker

R. S. J. d.

(2007a). Is gaming the system state-or-trait? Educational data mining through the multi-contextual application of a validated behavioral model. In Complete On-Line Proceedings of the Workshop on Data Mining for User Modeling at the 11th International Conference on User Modeling 2007 (pp. 76-80). Boston, MA: User Modeling Inc.

Baker

R. S. J. d.

(2007b). Modeling and understanding students’ off-task behavior in intelligent tutoring systems. In Proceedings of ACM CHI 2007: Computer-Human Interaction (pp. 1059-1068). New York, NY: Association for Computing Machinery.

Baker

R. S. J. d.

Corbett

A. T.

Aleven

(2008). More accurate student modeling through contextual estimation of slip and guess probabilities in Bayesian knowledge tracing. In Proceedings of the 9th International Conference on Intelligent Tutoring Systems (pp. 406-415). New York, NY: Springer.

Baker

R. S. J. d.

Corbett

A. T.

Koedinger

K. R.

(2004). Detecting student misuse of intelligent tutoring systems. In Proceedings of the 7th International Conference on Intelligent Tutoring Systems (pp. 531-540). New York, NY: Springer.

Baker

R. S. J. d.

Corbett

A. T.

Roll

Koedinger

K. R.

(2008). Developing a generalizable detector of when students game the system. User Modeling and User-Adapted Interaction, 18(3), 287-314.

10.

Baker

R. S. J. d.

Corbett

A. T.

Wagner

A. Z.

(2006). Human classification of low-fidelity replays of student actions. In Proceedings of the Educational Data Mining Workshop at the 8th International Conference on Intelligent Tutoring Systems (pp. 29-36). New York, NY: Springer.

11.

Baker

R. S. J. d.

Gowda

S. M.

(2010). An analysis of the differences in the frequency of students’ disengagement in urban, rural, and suburban high schools. In Proceedings of the 3rd International Conference on Educational Data Mining (pp. 11-20). New York, NY: International Educational Data Mining Society.

12.

Baker

R. S. J. d.

Mitrovic

Mathews

(2010). Detecting gaming the system in constraint-based tutors. In Proceedings of the 18th Annual Conference on User Modeling, Adaptation, and Personalization (pp. 267-278). New York, NY: Springer.

13.

Baker

R. S. J. d.

Walonoski

Heffernan

Roll

Corbett

Koedinger

(2008). Why students engage in “gaming the system” behavior in interactive learning environments. Journal of Interactive Learning Research, 19(2), 185-224.

14.

Baker

R. S. J. d.

Yacef

(2009). The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining, 1(1), 3-17.

15.

Beal

C. R.

Lee

(2008). Mathematics motivation and achievement as predictors of high school students’ guessing and help-seeking with instructional software. Journal of Computer Assisted Learning, 24(6), 507-514.

16.

Beck

J. E.

(2005). Engagement tracing: Using response times to model student disengagement. In Proceedings of the International Conference on Artificial Intelligence and Education (AIED2005) (pp. 88-95). Amsterdam, Netherlands: IOS Press.

17.

Bong

(2001). Between-and within-domain relations of academic motivation among middle and high school students: Self-efficacy, task value, and achievement goals. Journal of Educational Psychology, 93(1), 23-34.

18.

Casey

(1979). An analysis of errors made by junior secondary pupils on written mathematical tasks (Unpublished master’s thesis). Monash University, Melbourne, Australia.

19.

Chen

Klahr

(1999). All other things being equal: Acquisition and transfer of the control of variables strategy. Child Development, 70(5), 1098-1120.

20.

Clarkson

(1983). Types of errors made by Papua New Guinean students. Educational Studies in Mathematics, 14(4), 355-367.

21.

Clements

(1980). Analyzing children’s errors on written mathematical tasks. Educational Studies in Mathematics, 11(1), 1-21.

22.

Clements

(1982). Careless errors made by sixth-grade children on written mathematical tasks. Journal for Research in Mathematics Education, 13(2), 136-144.

23.

Cocea

Hershkovitz

Baker

R. S. J.

(2009). The impact of off-task and gaming behaviors on learning: Immediate or aggregate? In Proceedings of the 14th International Conference on Artificial Intelligence in Education (pp. 507-514). Amsterdam, Netherlands: IOS Press.

24.

Collins

Stevens

A. L.

(1991). A cognitive theory of inquiry teaching. In Goodyear

(Ed.), Teaching knowledge and intelligent tutoring (pp. 203-230). Norwood, NJ: Ablex.

25.

Corbett

A. T.

Anderson

J. R.

(1995). Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4(4), 253-278.

26.

Darnon

Harackiewicz

J. M.

Butera

Mugny

Quiamzade

(2007). Performance-approach and performance-avoidance goals: When uncertainty makes a difference. Personality and Social Psychology Bulletin, 33(6), 813-827.

27.

Dawson

Macfadyen

Lockyer

Mazzochi-Jones

(2011). Using social network metrics to assess the effectiveness of broad-based admission practices. Australasian Journal of Educational Technology, 27(1), 16-27.

28.

De La Torre

Douglas

J. A.

(2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69(3), 333-353.

29.

Elliot

A. J.

Church

M. A.

(1997). A hierarchical model of approach and avoidance achievement motivation. Journal of Personality and Social Psychology, 72, 218-232.

30.

Everitt

B. S.

Landau

Leese

Stahl

(2011). Cluster analysis (5th ed.). West Sussex, UK: John Wiley.

31.

Georges

(1929). The nature of difficulties encountered in reading mathematics. School Review, 37(3), 217-226.

32.

Gong

Beck

Heffernan

N. T.

Forbes-Summers

(2010). The impact of gaming (?) on learning at the fine-grained level. In Proceedings of the 10th International Conference on Intelligent Tutoring Systems (pp. 194-203). New York, NY: Springer.

33.

Haury

D. L.

(1993). Teaching science through inquiry. Retrieved from ERIC database. (ED359048)

34.

Hurlock

McDonald

(1934). Undesirable behavior traits in junior high school students. Child Development, 5(3), 278-290.

35.

Jeong

Biswas

(2008). Mining student behavior models in learning-by-teaching environments. In Proceedings of the 1st International Conference on Educational Data Mining (pp. 127-136). New York, NY: International Educational Data Mining Society.

36.

Johns

Woolf

B. P.

(2006). A dynamic mixture model to detect student motivation and proficiency. In Proceedings of the National Conference on Artificial Intelligence (pp. 163-168). Palo Alto, CA: Association for the Advancement of Artificial Intelligence.

37.

Junker

B. W.

Sijtsma

(2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25(3), 258-272.

38.

Ketter

L. C.

(2006). High-stakes testing, achievement-goal structures, academic-related perceptions, beliefs, strategies, and school belonging among selected eighth-grade students in a northwest Florida school district (Unpublished doctoral dissertation). University of West Florida, Pensacola.

39.

Kinnebrew

J. S.

Biswas

Sulcer

(2010). Modeling and measuring self-regulated learning in teachable agent environments. Journal of e-Learning and Knowledge Society, 7(2), 19-35.

40.

Luo

Paris

S. G.

Hogan

Luo

(2011). Do performance goals promote learning? A pattern analysis of Singapore students’ achievement goals. Contemporary Educational Psychology, 36(2), 165-176.

41.

Maydeu-Olivares

D’Zurilla

T. J.

(1996). A factor-analytic study of the Social Problem-Solving Inventory: An integration of theory and data. Cognitive Therapy and Research, 20(2), 115-133.

42.

McClure

(1929). Characteristics of problem children based on judgments of teachers. Journal of Juvenile Research, 13, 124-140.

43.

Middleton

M. J.

Midgley

(1997). Avoiding the demonstration of lack of ability: An underexplored aspect of goal theory. Journal of Educational Psychology, 89(4), 710-718.

44.

Midgley

Maehr

M. L.

Hruda

L. Z.

Anderman

Freeman

K. E.

. . . Urdan

, (2000). Manual for the Patterns of Adaptive Learning Scales (PALS). Ann Arbor, MI: University of Michigan.

45.

Mierswa

Wurst

Klinkenberg

Scholz

Euler

(2006). YALE: Rapid prototyping for complex data mining tasks. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 935-940). New York, NY: Association for Computing Machinery.

46.

Militello

Heffernan

(2009). Which one is “just right”? What every educator should know about formative assessment systems. Kansas City, MO: National Council of Professors of Educational Administration.

47.

Morgan

(1979). A criterion-referenced measurement model with corrections for guessing and carelessness. Camberwell: Australian Council for Educational Research.

48.

Muldner

Burleson

Van

Sande

VanLehn

(2011). An analysis of students’ gaming behaviors in an intelligent tutoring system: Predictors and impacts. User Modeling and User-Adapted Interaction, 21(1), 99-135.

49.

Newman

M. A.

(1977). An analysis of sixth-grade pupils’ errors on written mathematical tasks. Victorian Institute for Educational Research Bulletin, 39, 31-43.

50.

Novak

(1964). Scientific inquiry. Bioscience, 14, 25-28.

51.

Pardos

Z. A.

Baker

R. S. J. d.

Gowda

S. M.

Heffernan

N. T.

(2011). The sum is greater than the parts: Ensembling models of student knowledge in educational software. SIGKDD Explorations, 13(2), 37-44.

52.

Raftery

A. E.

(1995). Bayesian model selection in social research. Sociological Methodology, 25, 111-164.

53.

Rodríguez-Fornells

Maydeu-Olivares

(2000). Impulsive/careless problem solving style as predictor of subsequent academic achievement. Personality and Individual Differences, 28(639), 639-645.

54.

Romero

Ventura

(2007). Educational data mining: A survey from 1995 to 2005. Expert Systems With Applications, 33(1), 135-146.

55.

Romero

Ventura

(2010). Educational data mining: A review of the state of the art. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 40(6), 601-618.

56.

Ross

M. E.

Shannon

D. M.

Salisbury-Glennon

J. D.

Guarino

(2002). The Patterns of Adaptive Learning Survey: A comparison across grade levels. Educational and Psychological measurement, 62(3), 483-497.

57.

San Pedro

Baker

R. S. J. d.

Rodrigo

. (2011). Detecting carelessness through contextual estimation of slip probabilities among students using an intelligent tutor for mathematics. In Proceedings of 15th International Conference on Artificial Intelligence in Education (pp. 304-311). Amsterdam, Netherlands: IOS Press.

58.

Sao Pedro

M. A.

Baker

R. S. J. d.

Gobert

Montalvo

Nakama

. (in press). Leveraging machine-learned detectors of systematic inquiry behavior to estimate and predict transfer of inquiry skill. User Modeling and User-Adapted Interaction.

59.

Sao Pedro

M. A.

Baker

R. S. J. d.

Montalvo

Nakama

Gobert

J. D

. (2010). Using text replay tagging to produce detectors of systematic experimentation behavior patterns. In Proceedings of the 3rd International Conference on Educational Data Mining (pp. 181-190). New York, NY: International Educational Data Mining Society.

60.

Siemens

Long

(2011). Penetrating the fog: Analytics in learning and education. Educause Review, 46(5), 30-40.

61.

Walonoski

Heffernan

(2006). Detection and analysis of off-task gaming behavior in intelligent tutoring systems. In Proceedings of the 8th International Conference on Intelligent Tutoring Systems (pp. 382-391). New York, NY: Springer.

62.

Witten

I. H.

Frank

Hall

M. A.

(2011). Data mining: Practical machine learning tools and techniques (3rd ed.). Burlington, MA: Morgan Kaufmann.