Abstract
The Next Generation Science Standards (NGSS) for K-12 science education has outlined new standards that integrate science inquiry practices with scientific concepts and ideas. The challenge with implementing this framework has been determining how to provide students with authentic scientific experiences and real-time individualized scaffolding during inquiry, as well as reliably and validly assess students’ inquiry competencies. This article reviews current computer-based educational technologies, namely, educational data mining and natural language processing, and describes how these technologies have been used to automatically assess science inquiry practices aligned with NGSS practices. The second section describes the implementation of real-time adaptive, individualized scaffolds and instruction, based on automated inquiry assessment techniques. Finally, we aim to direct the attention of policy makers toward the use of technology to promote significant progress of nationwide inquiry-based learning, teaching, and assessment.
Keywords
Tweet
Computer-based techniques, such as educational data mining and natural language processing, allow for automatically scoring science inquiry competencies and the facilitation of personalized instruction and tutoring that adapt to individual students.
Key Points
Educators require assessments to accurately and effectively capture students’ science inquiry competencies according to national and international science standards.
Researchers have developed and implemented automated assessment techniques for online inquiry environments that can capture students’ investigative and writing competencies.
Automated techniques allow for the use of real-time scaffolding for students and alerting for teachers as students conduct virtual inquiry investigations.
Policy makers should consider how large-scale implementation of virtual science inquiry environments could affect the current standing of the United States in terms of students’ science proficiency.
Researchers will need to collaborate to develop wide-scale automated assessments and educational technologies that capture the full complement of the Next Generation Science Standards’ practices, crosscutting concepts, and disciplinary core ideas to best support all students.
Introduction
Science is everywhere in our daily lives. Simple memorization of facts cannot equip students with the deep knowledge and scientific thinking required of citizens in the 21st century. To be educated as scientifically literate persons, students need to be appropriately assessed. Three primary standards appear in the Programme for International Student Assessment (PISA; 2015) science framework: “knowledge of the scientific ideas and the questions that frame the practice and goals of science, knowledge and understanding of scientific inquiry, and the ability to interpret data and evidence scientifically” (Organisation for Economic Co-Operation and Development [OECD], 2016, p. 15). At the international level, the OECD administered the PISA to 15-year-olds throughout the world. The PISA 2015 assessment specifically concentrated on science and collaborative problem solving (compared with numeracy, literacy, and other cognitive skills) because these two areas were regarded as particularly important to economic development throughout the globe. PISA 2015 ranked the United States an unimpressive 25th place in science, out of 72 countries, and 19th place out of 35 OECD countries. PISA assessments correlate with economic progress, so they have a salient impact on policy decisions regarding science curriculum, science education, and capital investments globally.
In the United States, the National Research Council (NRC; 2012) Framework for K-12 Science Education and the Next Generation Science Standards (NGSS; NGSS State Leads, 2013) have reformed science assessment standards to align with the PISA science framework. The National Assessment of Educational Progress (NAEP) administered science tests to U.S. students in 2015 to evaluate student proficiencies relative to these standards. Only 22% to 38% of students were evaluated as proficient or better, whereas 24% to 40% of the fourth, eighth, and 12th graders were evaluated as “below basic” (National Center for Education Statistics [NCES], 2015). Results from both international and national science assessments warn us that U.S. students’ science competencies need improvement to catch up to other OECD countries and to ensure the United States a leading role in the world economy.
Science education reforms in the United States have increasingly advocated for including inquiry in science education. The NRC (2012) and NGSS Lead States (2013) summarized eight science inquiry practices for students in Grades K-12, including (a) asking questions, (b) developing and using models, (c) planning and carrying out investigations, (d) analyzing and interpreting data, (e) using mathematics and computational thinking, (f) constructing explanations, (g) engaging in argument from evidence, and (i) obtaining, evaluating, and communicating information. Expectations for each practice become increasingly sophisticated as students proceed through each grade level (NRC, 2012). Presently, 40 states have adopted or adapted the NGSS, and as a result, educators need to incorporate more inquiry experiences into science instruction and assessment. Meeting these goals, however, poses significant challenges, including, but not limited to, rising costs, limited availability and the maintenance of using traditional hands-on labs, and the lack of economic, authentic, inquiry-based learning and assessment environments available to teachers and students (Boesdorfer & Livermore, 2018). Consequently, students receive minimal exposure to inquiry-based instruction and investigations (OECD, 2016).
To address these challenges, learning and teaching through computer-based inquiry simulations could enable students to view and interact with models of scientific phenomena or processes (de Jong & Van Joolingen, 1998). In other words, within the simulated environments, students can engage in the science inquiry practices outlined in the NGSS. The use of computer-based simulations can prevent expenses involved in scoring hands-on labs, reduce learning time, and enable the collection of detailed logs of student–computer interactions (Gobert, Moussavi, Li, Sao Pedro, & Dickler, 2018). Detailed logs include both event logs, such as key clicks, and text logs, such as written responses (Li, Gobert, & Dickler, 2018). The resulting log files containing the data are often massive and complex, making it difficult to identify and extract useful information to directly evaluate students’ inquiry performance. Challenges with interpreting these overwhelmingly large data have led to a continued use of conventional multiple-choice items or summative open response/essay questions to evaluate student inquiry competencies even within new simulation-based inquiry assessment systems (R. S. Baker, Clarke-Midura, & Ocumpaugh, 2016). These question types, however, do not capture the rich processes involved in science inquiry (Gobert, Sao Pedro, Raziuddin, & Baker, 2013).
Virtual inquiry within simulations and online environments involves detailed log files. Log data can inform performance assessment if analyzed efficiently and effectively (Gobert et al., 2018; Quellmalz, Timms, Silberglitt, & Buckley, 2012). Advanced educational technology, such as educational data mining (EDM) and natural language processing (NLP), can automatically assess inquiry competencies using students’ logged actions and writing. Specifically, these technologies can assess performance on multiple competencies aligned to NGSS practices (as in the inquiry intelligent tutoring system, Inq-ITS; Gobert et al., 2018). Automated assessments can identify students’ difficulties during inquiry and the subsequent application of real-time feedback: adaptive, individualized scaffolding for students and real-time reports and suggestions for instructors to conduct individualized instruction (Li, Gobert, Dickler, & Moussavi, 2018).
This article aims to alert policy makers both to (a) technology in educational settings—to enhance inquiry-based science learning, teaching, and assessment—and to (b) the support researchers and teachers need to yield significant progress on nationwide inquiry-based learning and assessment. This article consists of three sections. The first section describes current educational technologies and how these technologies can automatically assess science inquiry practices according to two categories. One category involves doing science, that is, assessment of investigative competencies (aligned with NGSS practices 1-5). Another category addresses writing about science, that is, the assessment of students’ explaining or arguing using evidence (linked to NGSS practices 6-8). The second section describes real-time adaptive, individualized scaffolding and instruction, based on automated inquiry assessment. Finally, we discuss implications for policy makers in terms of professional development for teachers and large-scale assessments that fully capture students’ science inquiry.
Automated Assessment for Doing Science
Computer-based inquiry environments can enhance the teaching and learning of science inquiry. Computer environments include interactive simulations (e.g., Gobert et al., 2018), cloud labs (virtual experiments that can be accessed by multiple users at the same time from various locations; e.g., Hossain et al., 2017), virtual reality environments (e.g., Sung, Hwang, Wu, & Lin, 2018), and educational games or simulations (e.g., Tsai, 2017). These environments involve dynamic, interactive labs with rich simulations (de Jong, Linn, & Zacharia, 2013; Gobert et al., 2018) that mimic real-world processes or situations (de Jong et al., 2013). Simulations have been designed for multiple users (Clarke-Midura, Code, Zap, & Dede, 2012; Ketelhut, Dede, Clarke, Nelson, & Bowman, 2017), three-dimensional (3D) immersive environments (Sung et al., 2018), intelligent tutoring systems (ITSs; online learning and assessment environments) where one human learner interacts with an animated pedagogical agent (Gobert et al., 2018; Li, Gobert, Dickler, & Moussavi, 2018), or conversation-based ITSs where one human learner interacts with two conversational computer agents (Zapata-Rivera, Liu, Chen, Hao, & von Davier, 2017). Many of these educational technologies not only enable students to experience realistic science inquiry but also automatically assess students’ inquiry competencies for both doing science and writing about science.
Assessing students’ competencies at “doing” science inquiry, when online, requires examining their “behaviors” within virtual environments. Computer-based inquiry environments record students’ behaviors, including actions, mouse clicks, and other metadata (e.g., response time) during their investigations. These raw data do not easily capture students’ performance on inquiry practices. EDM techniques, however, can detect patterns from large volumes of ill-defined and ill-structured data (R. S. J. Baker & Yacef, 2009; Scheuer & McLaren, 2011). EDM to automate inquiry assessment involves two challenges: (a) human agreement in judging the quality of complex inquiry processes, and (b) extracting useful information from extensive logged behavioral data. The following sections present solutions to these challenges.
Human Scoring of Investigative Competencies
It is difficult to yield high agreement between two human coders using the same rubric to evaluate students’ investigative performance within log files, due to both the complexity of inquiry practices and the large volume of student actions tracked within a single student’s log file. For example, when conducting an experiment in a simulation, some students may run repeated trials or search for relationships across different experiments (Gobert et al., 2013). In both cases, students understand how to design controlled experiments but are engaging in other kinds of valid exploration behaviors. The successive controlled experiments scoring rule, though, would yield a low estimate of skill (Gobert et al., 2013). Specifically, some rubrics are too stringent and allow students to receive credit for designing a controlled experiment only if they run a pair of sequential trials where they change only one variable at a time. Other evaluations are too lenient, to the point where students receive credit for designing a controlled experiment as long as any two trials have a target variable that changes while the other variables are kept the same. Evaluations need to consider multiple features of student inquiry to avoid biased judgments. EDM allows for taking multiple features of log file data and the interactions between these features into account when evaluating student competencies.
Automated assessment techniques can be developed in multiple ways. Gobert and colleagues (Gobert & Sao Pedro, 2017; Gobert, Sao Pedro, Baker, Toto, & Montalvo, 2012) designed automated assessment within Inq-ITS by first manually scoring students’ log files for multiple inquiry practices (R. S. J. Baker & de Carvalho, 2008). Specifically, human coders segmented log files into clips that contained multiple student actions. Coders then examined each clip to determine whether actions within the clip exemplified a particular inquiry practice according to research on how students conduct inquiry. Finally, the coders labeled the clip for whether a particular behavior was exemplified. This text replay tagging method has been used to identify whether or not a student demonstrates a target behavior such as designing controlled experiments (Gobert et al., 2013). The manually scored clips can then be used to train computer-based assessment models to automatically detect in future samples whether a behavior appears, such as planning and carrying out investigations, and analyzing and interpreting data.
Researchers also use a method where sequences of actions within log files are examined, which is called sequential pattern mining (Taub, Azevedo, Bradbury, Millar, & Lester, 2018). Specifically, this method involves manually parsing and coding sequences of actions within log files to identify common patterns that occur. The frequency of patterns for a particular individual can then be calculated (Grafsgaard, 2014) and compared across other individuals. Researchers have used this method to train and develop an automated scoring system that can then be applied to distinguish between more and less proficient participants during gameplay (Fournier-Viger, Gomariz, Campos, & Thomas, 2014; Taub et al., 2018).
Automated Scoring Models to Predict Investigative Competencies
Automated scoring models can use different underlying technologies or algorithms and be built using different computer packages. One challenge with building automated assessment models based on features within log files, however, is selecting the features that are most robust and result in the highest model performance. Features can include any observable characteristic or variable within the data, so a single log file could have thousands of features.
One method that has successfully overcome the complex task of feature selection first identifies potential behaviors of interest based on theoretical frameworks for science assessment (such as the PISA 2015 science framework, OECD, 2016; or the NGSS framework, NGSS State Leads, 2013). Selected actions then serve as features, based on both the inquiry behaviors of interest and information available within log files. Broad features based on student actions include movements, mouse clicks, and time spent on particular tasks within virtual environments. For example, within the science inquiry Virtual Performance Assessment environment, researchers have extracted features—such as the time students spent carrying out actions when transitioning between sections of the environment and whether students completed particular actions (i.e., talking to a computer character)—to predict students’ general inquiry proficiencies (R. S. Baker et al., 2016; LaMar, Baker, & Greiff, 2017).
Other extracted features tend to be more content specific to capture performance on inquiry practices. In previous studies (Gobert et al., 2012, 2013), robust features emerged through manually scoring log files and training automated assessment models for four primary inquiry practices as well as 12 subpractices (Li, Gobert, Dickler, & Moussavi, 2018). For instance, when evaluating students’ proficiencies at designing experiments in an ITS simulation (Gobert et al., 2013), the initial automated scoring model involved 70 possible features. The final predictive features selected to build the model included the number of repeated trials, the total number of trials, the number of trials where only one variable was changed, the number of trials testing the hypothesis, and more to evaluate students’ proficiencies. These features are robust with diverse groups of students and match human coders with high precision, ranging from 84% to 99% accuracy (Gobert et al., 2018). These features account for not only student actions within an environment but also the content of student actions in relation to other actions.
Automated Assessment for Writing
Central NGSS practices, such as constructing explanations and arguing using evidence, involve communicating in a written, open response format. These responses are difficult and time-consuming to grade at scale, which often results in feedback being given too late to benefit student learning. Advancements in computational linguistics (Jurafsky & Martin, 2008), discourse science (Graesser, Li, & Forsyth, 2014), and mathematical representations of world knowledge (Landauer, McNamara, Dennis, & Kintsch, 2007) have helped develop automated assessments for written scientific explanations. Three types of automated techniques are used for scoring scientific explanations in inquiry contexts. The first approach is based on the accuracy of content within written responses. The second approach is based on the representation of language within responses. The third approach involves evaluating student utterances within conversational exchanges between a student and virtual agent(s). All techniques require rubrics that human raters use to manually grade student responses. Human scores serve as the gold standard criteria to evaluate the performance of computer scoring. The primary difference between the first two approaches would be that computer scoring in the first approach evaluates whether responses address target content or concepts, whereas computer scoring in the second approach uses typical linguistic features to predict the quality of language use within responses. The third approach attends to both the content and direction of conversations to draw a conclusion about students’ competencies.
Three tools are used to automatically score open-ended responses based on students’ identification of key concepts: c-rater-ML (former version: c-rater) of Educational Testing Service (Liu, Rios, Heilman, Gerard, & Linn, 2016), the Summarization Integrated Development Environment (SIDE) developed by Rose and colleagues at Carnegie Mellon University (Nehm, Ha, & Mayfield, 2012), and EvoGrader developed by Nehm and his colleagues at Stony Brook University with the LightSIDE engine (Mayfield, Adamson, & Rosé, 2014). These three tools have one characteristic in common: automated scoring technologies that detect patterns and classify texts based on the presence or absence of particular scientific concepts. These tools have shown satisfactory agreement with human raters.
Besides these tools, methods that use “regular expressions” can capture the presence (or absence) of key concepts or complex expressions within written responses to state a claim, provide data as evidence, and reason using evidence. The regular expressions approach can account for alternative forms of words, including misspellings (such as \expl* handling explain, explains, explaining, explanation, explunate, and explinatin), and syntactically precise wording of intended messages (“the doctor knew that the patient sued the hospital” is different from “the hospital knew that the patient sued the doctor”). The regular expressions method takes a great deal of human effort to generate, but shows the best performance: scores that almost perfectly correlate with human scores (Li, Gobert, & Dickler, 2017a). Scoring written responses based on concepts and content allows for enhanced content-based feedback and scaffolding.
Automated assessment techniques that attend to different components of written language (i.e., syntax, coherence, etc.) can be used to assess the quality of language use within written scientific explanations (Li, Gobert, Dickler, & Morad, 2018; Wiley et al., 2017). Automated text analysis tools, such as Coh-Metrix (Graesser et al., 2014), capture different components of written language. Coh-Metrix can extract hundreds of language features at multiple textual levels from the word level to the sentence, paragraph, text, and genre (Graesser & McNamara, 2011). Although Coh-Metrix can assess the cohesion, causality, and lexical diversity of written explanations (Li, Gobert, Dickler, & Morad, 2018; Wiley et al., 2017), its performance is generally lower than the regular expression approaches (Li et al., 2017a). Because linguistic features do not attend to the content of student writing (they focus solely on the use of language), these features are limited in terms of their ability to capture the variety of issues students confront when writing scientific explanations. Some researchers have combined student behavioral features with linguistic features to predict human scores of scientific explanations (Li, Gobert, & Dickler, 2018), but adding behavioral features did not largely increase the performance.
Conversations in virtual environments are also used to evaluate students’ competencies in writing in science inquiry contexts. For example, trialogues are used to assess science inquiry skills (Graesser et al., 2014). Trialogues involve a conversational exchange between a student and two virtual agents on a given topic, such as Volcanoes (Zapata-Rivera et al., 2017). Students’ competencies are partially evaluated based on the paths (i.e., series of utterances exchanged between the student and agents) of trialogue conversations. Paths are scored according to the accuracy and complexity of student responses within a conversation. Performance on paths, as well as other factors such as the help sought by students, allows for gaining a complete and accurate picture of students’ inquiry competencies. This technique, however, showed lower performance than regular expressions (Li et al., 2017a).
These automated methods facilitate the possibility to use multiple formats and techniques to fully assess students’ inquiry practice competencies. Some students tend to have “messy middle knowledge” in relation to various competencies. In other words, students may be highly competent in one area but struggle in a different area. Students demonstrate messy middle knowledge in terms of content understandings and reasoning abilities (Gotwals & Songer, 2009), and students also demonstrate inconsistencies in performance between their doing and writing in science inquiry contexts (Li, Gobert, & Dickler, 2017b). As a result, assessments should take into account and distinguish between the various forms of student competencies, such as inquiry proficiency, relative to general content knowledge or writing skills (Gobert et al., 2018; Li, Gobert, & Dickler, 2018; Quellmalz et al., 2012).
Real-Time Feedback and Alerting
Teachers often struggle to decide when and how to guide students, as difficulties can be unpredictable and vary extensively from individual to individual (Gobert et al., 2018). Without proper support, the full realization of the NGSS can be extremely difficult for teachers in terms of meeting the needs of all students. Automated assessment for science inquiry allows for providing automated, individualized feedback in real time based on students’ performance. Several systems use scaffolded feedback, which involves providing supports to a student engaged in a task that may otherwise be beyond that students’ current skill level (namely, beyond their Zone of Proximal Development [ZPD]; Vygotsky, 1978). Students often have difficulty completing science inquiry tasks without guidance (Hmelo-Silver, Duncan, & Chinn, 2007), which calls for integrating carefully designed scaffolds into science inquiry contexts within virtual assessments (Quintana et al., 2004). Real-time feedback facilitates improvements in inquiry practice competencies in terms of both scientific doing competencies (Gobert et al., 2018; Li, Gobert, & Dickler, 2018; Quellmalz et al., 2012) and explanatory writing-based competencies (Tansomboon, Gerard, Vitale, & Linn, 2017).
Scaffolding in online environments may involve providing hints (“How about Y?), prompts (“Which force leads to this fall?”), and pumps (“What else”) (Graesser et al., 2014) on how to go about completing a task. Sometimes, these scaffolds involve short utterances that are positive (“Excellent”), negative (“Nope”), or neutral (“Alright”). Scaffolds may also involve directed feedback based on student performance on a particular task (Li & Baer, 2018). Technological systems such as Inq-ITS (Gobert et al., 2018) and SimScientist (Quellmalz et al., 2012) provide scaffolded feedback based on students’ performances on particular inquiry practices and subpractices. Supports in these systems improved student inquiry proficiencies across practices and topics (Li, Gobert, Dickler, & Moussavi, 2018; Quellmalz et al., 2012). Personalized feedback that addresses student writing performance has also been successful, particularly when explicitly teaching students why they received particular feedback and how the feedback was developed based on their performance (Tansomboon et al., 2017).
Automated assessment of inquiry competencies also allows for real-time alerts to teachers regarding students’ competencies. Teacher dashboards are online platforms that report data on students’ performance as the students engage in science inquiry investigations (Gobert et al., 2018; Schifter, Natarajan, Ketelhut, & Kirchgessner, 2014). Data can appear within the dashboard in the form of statistics, bar graphs, or other visualizations. On some dashboard platforms, teachers may access information for individual students, as well as information on the entire class. For instance, the teacher dashboard Inq-Blotter provides teachers with information on students’ performance on various science inquiry practices as the students conduct investigations in Inq-ITS. Teachers can sort the alerts by time (i.e., the most recent alerts appear first), by type (i.e., organized according to practice, such as analyzing data or collecting data), and by student (i.e., each listed alphabetically). Teachers can click on an alert for a student to identify the inquiry practice, the student’s progress within a particular lab, and the student’s overall performance across each of the different inquiry practices.
Policy Implications
Some people remain skeptical of computers’ ability to score students’ actions and writing in science inquiry because of assumptions that only humans have the capacity to accurately grade student performance (Li, Shubeck, & Graesser, 2016). Some people also have concerns that log files cannot measure students’ competencies due to the complexity of analyzing large volumes data, but advanced educational technologies make this analysis possible, efficient, and effective. In fact, when some automated systems’ performance is evaluated and compared with trained human judges, the automated systems demonstrate extremely high performance. In some contexts and applications, the best way to assess inquiry performance within computer-supported learning and assessment environments would be with EDM techniques or NLP techniques. These techniques are not only time- and cost-effective but also prevent biases that may occur as a result of human scoring. The public, educational professionals, and policy officials need to be educated on the many benefits of these automated assessment technologies for science inquiry in terms of time, cost, and general performance. With less financial resources going toward the grading of large-scale assessments for science inquiry, testing agencies and school districts can allocate these resources toward developing high-quality learning materials for teachers and students. School districts and teachers will benefit from the immediate and detailed feedback they receive regarding student performance. Students will benefit from the individualized and accurate feedback they receive as a result of engaging in high-tech environments. The next steps, however, require wide-scale development and implementation of these technologies through collaborations across various research groups as well as support from grant agencies, testing agencies, and policy officials.
Grant agencies have invested in projects to develop computer-supported inquiry environments, but it is rare to find an inquiry environment consistently implemented across Grades K to 12. This may explain why schools do not have an inquiry environment for students to use systematically in science class. Grant agencies should encourage proposals that develop scalable simulated environments in classrooms, as well as professional development for teachers on using advanced simulated labs, helping students access the simulated labs, and acquiring necessary computer equipment. Specifically, collaborations across researchers that focus on various grade levels and science topics are needed to ensure that assessment environments for the various disciplinary core ideas and science topics outlined in NGSS are available for all students from K through 12.
Meanwhile, grant agencies should also encourage proposals for assessments in simulated environments that evaluate the full complement of inquiry practices because “doing” inquiry is different from writing about inquiry (Li et al., 2017b). Doing and writing also differ from the content knowledge captured through multiple-choice questions (Li, Gobert, & Dickler, 2018). Students’ overall inquiry competencies need evaluating based on both behaviors and writing rather than traditionally formatted tests.
Inquiry environments are currently used to assess student performance in classrooms or online courses, but few high-stakes national assessments are used to directly evaluate the science inquiry investigation processes. Considering the scalability and effectiveness of many advanced educational technologies, this could easily change. For example, the Inq-ITS system enables automated, authentic, scalable assessment of middle school students’ inquiry competencies aligned to the NGSS practices (Gobert et al., 2013). An open course on the edX platform facilitates inquiry-based learning through the cloud lab to promote authentic scientific practices (Hossain et al., 2017). However, high-stakes assessments for inquiry skills still adopt traditional test formats. The agencies and testing services for large-scale assessment need to consider the possibility of integrating these educational technologies into their computer-based tests to accurately and efficiently capture students’ science inquiry competencies.
Overall, automated assessments and new educational technological developments will directly benefit not only students and teachers but also school districts and policy officials looking to best support students in reaching the high standards of performance in science for which they are capable. The United States’ current PISA ranking in science is just one indicator of the need to accept and utilize these educational technologies to help our students reach their fullest potential.
Footnotes
Declaration of Conflicting Interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Dr. Janice Gobert is CEO and Co-Founder of Apprendis, who is commercializing Inq-ITS and Inq-Blotter.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
