Examining Curriculum-Based Measurement Screening Tools in Middle School Science: A Scaled Replication Study

Abstract

This replication study examined the alternate form reliability, criterion validity, and predictive utility of two curriculum-based measurement (CBM) tools in science, Vocabulary-Matching (VM) and Statement Verification for Science (SV-S), for the purpose of screening. In all, 205 seventh-grade students from four middle schools were given alternate forms of each science CBM tool. Scores from the Idaho Standards Achievement Test (ISAT) science assessment were obtained. Stronger evidence of reliability and validity with the ISAT was found for VM compared with SV-S. With regard to predictive utility, VM more accurately classified students’ at-risk status compared with SV-S for identifying proficiency on the ISAT. Practical implications and directions for future research are also discussed.

Keywords

curriculum-based assessment education assessment response to intervention (RTI)/multi-tiered system of supports (MTSS)science disciplines and subjects validity measurement reliability diagnostic classification models

Recently, the National Center for Educational Statistics (NCES; 2016) reported that over 63% of eighth-grade students scored below basic levels in science knowledge. The authors of the Next Generation Science Standards (NGSS) suggest that science should be comprehensible to all students (Lee, Miller, & Januszyk, 2014). However, many students often struggle comprehending the content presented in science classes due to complex vocabulary (Fang, 2006). Deficits in word knowledge represent primary barriers to engaging with science, communicating and interacting with science content and activities, and ultimately developing science content knowledge and interest that facilitates subsequent access to science, technology, engineering, and mathematics education and careers (Therrien, Benson, Hughes, & Morris, 2017).

Students need to access the vocabulary of grade-level science curriculum to communicate conceptual understanding. This requires educators to be equipped with efficient tools to adequately support students in science learning and help all students fulfill their potential, including at the secondary level (Bravo & Cervetti, 2008; Espin et al., 2013). As students get older and progress in school, vocabulary knowledge and reading comprehension become interdependent, especially in classes where texts increasingly include substantial content-specific terminology (Busch & Espin, 2003). To proactively identify and monitor students at risk for content-area difficulties, teachers need a brief, easy to administer, and easy to score assessment that can be used for making inferences regarding students’ comprehension of vocabulary related to content that has been, is being, or will be taught. Universal screening promotes early identification, intervention, and subsequent prevention of risk for failure (Glover & Albers, 2007). This proactive, rather than reactive, model improves students’ long-term performance by addressing student needs prior to experiencing failure (Jenkins, Hudson, & Johnson, 2007).

Curriculum-based measurement (CBM; Hosp, Hosp, & Howell, 2016) is a type of brief assessment that is commonly used in K-12 settings to assist educators with screening (i.e., identifying students at risk) and progress monitoring (i.e., measuring growth overtime) in reading, mathematics, and writing to make instructional decisions. CBM tools in content areas such as social studies (e.g., Espin, Busch, Shin, & Kruschwitz, 2001) and, more recently, science (e.g., Ford & Hosp, 2017; Ford, Conoyer, Lembke, Smith, & Hosp, 2018; Espin et al., 2013) have also been examined, but there is still much to be explored.

Science CBM Screening Studies

To date only a few researchers have examined the development and use of CBM as a screening tool in the area of science. The majority of these studies have focused on Vocabulary-Matching (VM) and Statement Verification for Science (SV-S).

VM

The VM measure requires a student to match a set of terms with definitions in a 5-min time period (Espin et al., 2001). This task was first investigated when vocabulary was found to be a stronger predictor compared with oral reading fluency (i.e., amount of words read correctly in 1 min) on a locating-information task (Espin & Foegen, 1996). This initial investigation resulted in a VM measure that accounted for more variance and higher correlations with content-area tasks, above and beyond maze (e.g., every seventh word in a passage is deleted and replaced with three answer choices) or oral reading measures. Espin and Foegen (1996) recommended VM measures as a stand-alone, efficient, and effective predictor of students’ likely success in the general education curriculum related to particular content areas.

Subsequent researchers have investigated the usefulness of VM as a means to progress monitoring skill in social studies (Beyers, Lembke, & Curs, 2013; Espin, Shin, & Busch, 2005; Lembke et al., 2017) as well as science (Espin et al., 2013). Alternate form reliability for VM in science has ranged from r = .64 to .84 (Ford et al., 2018; Espin et al., 2013). Criterion-related validity coefficients with state science tests have ranged from r = .64 to .66 (Ford et al., 2018; Espin et al., 2013). However, investigation of VM measures as a screening tool to predict future performance has been limited. As states have begun to include assessments of science in their battery of high-stakes assessments, the possibility of examining concurrent and predictive utility has increased.

SV-S

The SV-S measure is a more recent science CBM screening tool, developed in conjunction with the Iowa Core Science Standards and aligned to the NGSS (Hosp & Ford, 2014; Ford & Hosp, 2017; Lee et al., 2014). The SV-S measure is a modification of the Sentence Verification Technique (Royer, Hastings, & Hook, 1979). When completing a Sentence Verification task, students first read three to four 12-sentence passages and an associated set of test sentences. Then, students indicate whether the test sentences are related to the passages they read. When completing SV-S measures, students respond to a series of 60 “true” and “false” items for 3 min. Rather than items being related to a recently read passage, the items are based on vocabulary and key terms related to state science standards and the NGSS that students are expected to learn. Learning in content areas requires students to read and make sense of content-area material, obtain information from instruction, and retain obtained information (Espin & Foegen, 1996). Thus, students’ ability to accurately respond to SV-S items depends on knowledge they have obtained via science instruction.

While SV-S measures are relatively new, previous work has demonstrated that for Grade 7, alternate form reliability has ranged from r = .41 to .58 (Ford & Hosp, 2017) and for Grade 8 from r = .39 to .45 (Ford & Hosp, 2017; Ford et al., 2018). Criterion-related validity with state science tests have ranged from r = .11 to .24 in Grade 7 (Ford & Hosp, 2017) and r = .18 to .40 in Grade 8 (Ford & Hosp, 2017; Ford et al., 2018). Although promising evidence of the technical adequacy for SV-S has been reported, there is a need for additional investigation with larger and more diverse samples.

Pilot Study

With multiple formats of science CBM being developed, it is important to consider which type of task may be the most efficient and technically adequate way to collect information regarding students’ academic performance. Ford et. al (2018) investigated how VM and SV-S screening tools differed in alternate form reliability, criterion validity, and predictive utility with the same sample of 25 eighth-grade students. Results indicated that alternate form reliability was strong for VM (r = .73), however, weak for SV-S (r = .40). When examining criterion validity, the VM measure was strongly correlated with the state standardized science assessment (r = .71) and again SV-S demonstrated a weaker relation (r = .40). The VM measure was a better predictor of student performance on the state standardized science assessment compared with the SV-S (z = 2.09, p = .04). However, this pilot study was limited due to a small sample and an administration error with the SV-S measure.

Need for Replication of Science Screening Studies

Recently, there have been numerous reviews of replication studies in the field of special education (Cook, Collins, Cook, & Cook, 2016; Coyne, Cook, & Therrien, 2016; Lemons et al., 2016; Makel et al., 2016). While the majority of these studies focus on intervention, Hosp, Ford, Huddle, and Hensley (2018) discussed the importance of replication in measurement research in special education and school psychology. They note that if the tools used do not possess evidence of adequate reliability, validity, or utility, they cannot be used with confidence in schools and could potentially undermine the evidence base of intervention outcomes. Thus, replication is necessary in measurement development to establish adequate psychometric properties that contribute to the utility of the measure for a specific purpose. Due to the limited research and development of CBM tools in content areas, replication is necessary to demonstrate adequate reliability and validity of the VM and SV-S measures (Hosp et al., 2018; Mooney & Lastrapes, 2018).

According to Schmidt (2009), replications are classified as either direct or conceptual. Direct replication studies use the exact same methods, design, and sample for the purposes of examining accuracy of the original investigation. Conceptual replications allow for different samples, methods, designs, or analysis to further explore a construct or, in this case, the validity of a measure with a different population of students or criterion. Furthermore, Schmidt discusses that replication studies can serve different purposes. For example, these studies can control for sampling error, lack of internal validity, and fraud. However, replications can also be used for generalizing results to larger populations or verifying hypotheses made in previous investigations. With these functions in mind, we sought to replicate Ford et al. (2018) with a larger sample size, adjusted development and administration procedures for SV-S, and a different state standardized science assessment for enhanced generalizability. This conceptual replication further extends the literature by also examining predictive validity to science learning of these measures.

Predictive Utility and Classification Accuracy of Science CBM

A key feature in potential screening CBM tools for science is the ability to identify students at risk of learning difficulties in science due to deficits in vocabulary knowledge. Predictive utility has often been investigated for CBM in the areas of reading and mathematics (e.g., Gersten et al., 2012). In a review of mathematics CBM, Gersten and colleagues (2012) defined predictive utility as the screening tool’s ability to predict later performance. That is, if a student obtained a particular score on the measure, it may indicate the student would benefit from additional supports. The authors suggested that a screening tool’s ability to predict future performance is crucial to the development and use of such a tool in any academic area.

Investigations of CBM as a screening tool in content areas have examined predictive utility using correlational analysis (Mooney & Lastrapes, 2016, 2018; Ford et al., 2018). Screening tool charts produced by the Center on Response to Intervention (n.d.) indicate that a threshold of .70 is necessary to identify a measure as having “convincing evidence” of validity. Moreover, the Center on Response to Intervention indicates that having the ability to accurately determine or classify which students are at risk is also a key feature for screening tools. In addition to correlational analysis, logistic regression, and receiver operating characteristic (ROC) curves analyses, specifically the area under the curve (AUC) statistics have more recently been used for predicting students’ performance using CBM in other areas such as reading and mathematics (Conoyer, Foegen, & Lembke, 2016; Johnson, Jenkins, Petscher, & Catts, 2009). The AUC is an overall score ranging from .5 to 1 for diagnostic accuracy that takes both sensitivity and specificity into account. In this case, an AUC statistic of .5 suggests the accuracy of the screening measure is likely due to chance, whereas a score closer to 1 indicates that the screening measure is better able to correctly classify a pair of students as at-risk or not (Youngstrom, 2013). At this time, there has been limited investigation of the accuracy of content-area CBM in predicting future performance using these diagnostic analyses.

Present Study

The current study is a conceptual replication of the differences between VM and SV-S as screening tools for middle school students in science content knowledge with a large sample. Based on the results of Ford et al. (2018), we hypothesized that the VM measure would again produce strong coefficients and that based on the revisions to the SV-S measure and its similar content, it would show stronger coefficients that were similar to VM. We also sought to expand this conceptual replication by exploring the predictive utility of each measure. The following research questions were addressed:

Research Question 1: What are the alternate-form reliability coefficients for VM and SV-S?

Research Question 2: What are the criterion-related validity coefficients with a state standardized science assessment for VM and SV-S?

Research Question 3: What is the predictive utility of science CBM screening tools (VM and SV-S) for identifying middle school students at risk?

Method

Participants

Four participating middle schools (Grades 7-9) in the Northwest United States agreed to participate. We chose seventh-grade students as participants given that this is the only grade where the ISAT Science Assessment (Idaho State Department of Education [SDE], 2017) is administered. After parent consent and student assent, a total of 205 students with full datasets were included in our sample. Our sample was mostly female and White (55.1% and 85.4%, respectively). Students with an Individualized Education Program or Section 504 plan accounted for 15.6% of our sample; 2.9% of students were identified as having Limited English Proficiency. For additional demographic information, see Table 1.

Table 1.

Student Demographics.

	Students (n = 205)
	n (%)
Gender
Female	113 (55.1)
Male	92 (44.9)
Ethnicity
American Indian or Alaska Native	1 (0.005)
Asian	5 (0.02)
Black or African American	4 (0.02)
Two or more races	3 (0.01)
Hispanic or Latino	16 (0.08)
Native Hawaiian or Other Pacific Islander	1 (0.005)
White	175 (85.4)
Limited English Proficient	6 (2.9)
IEP or Section 504 plan	32 (15.6)

Note. IEP = Individualized Education Program.

Measures

SV-S

Similar to the pilot study (Ford et al., 2018), the SV-S items identified with the highest discrimination values from previous analysis were used to create two alternate SV-S forms (A and B). Each SV-S form included 60 items total as previous researchers suggest this is the optimal number for a 3-min administration (Hosp & Ford, 2014). To keep discrimination values comparable across the two SV-S forms, items were assigned in an alternating fashion (e.g., the item with the highest discrimination value was included on Form A and the next highest on Form B, etc.). Once created, each form was reviewed and edited to ensure similar items were not on the same form. As a result, SV-S Form A had 60 items with an average discrimination value of 1.088, and SV-S Form B had 60 items with an average discrimination value of 1.104. Furthermore, the resulting difficulty values across SV-S forms were comparable as well (−0.37 for both SV-S Forms A and B). Using standardized directions, students were given 3 min to read each statement silently, determine if the statement was “true” or “false,” and then fill in the corresponding circle in the “yes” or “no” column. A raw score of the total number of correctly identified statements for each probe and the mean of the two forms were calculated.

VM

For VM, the first 20 statements from each SV-S form were translated into two corresponding VM forms with similar language from the SV-S statement used to create a VM definition for each vocabulary word. This allowed the content of SV-S and VM forms to be similar with only the format of the probe being presented differently. In addition, the construction of VM was also changed from traditional procedures (Espin et al., 2001). Instead of items being developed based on the current classroom curriculum, items were based on standards students are expected to learn. Despite differences in item development, formatting and administration procedures outlined in previous studies (e.g., Espin et al., 2001) for VM were implemented. Thus, each VM form consisted of 20 terms listed alphabetically on the left side of the page, and 22 definitions (including two distracters) listed in random order on the right side of the page (Espin et al., 2001). Standardized directions were provided to students to read the probe silently for 5 min and match the words with the definitions by writing the letter of the correct definition in the blank next to each word, with two definitions not used. A raw score of the total number of correct matches for each probe and the mean of the two forms were calculated.

Idaho Standards Achievement Test (ISAT) Science Assessment

The ISAT Science Assessment served as the criterion measure of overall science skills. It is a multiple-choice, fixed form (i.e., all students are administered the same items in the same order) test administered to students in Grades 5 and 7 in the spring (SDE, 2017). The untimed test is administered for approximately 1 hr and 30 min with a provided guideline that professional judgment be used to determine if students are actively engaged and should be provided additional time. The ISAT Science Assessment measures students’ knowledge in five categories: Nature of Science, Physical Science, Biology, Earth and Space Systems, and Personal and Social Perspectives—Technology. Students obtain a Scale Score, with a score of 213 indicating proficiency for seventh-grade students (SDE, 2017).

Procedures

Previously developed standardized procedures for administration and scoring were used for VM and SV-S measures. These CBM tools were administered by the second author within 2 weeks of schools administering the ISAT Science Assessment. All CBM forms were group administered to students during their science class at each of the four participating middle schools, with a range of five to 15 classes at each school. Students received a packet with two SV-S forms and two VM forms. Forms were counter-balanced across class periods by teacher and across schools. Two graduate researchers were trained to competence by the fifth author to score each of the CBM probes. Raw scores were then entered into a spreadsheet. All scores were double entered, and any disagreements were discussed and resolved. Ten of the students’ assessments were pulled at random for scoring reliability, equaling 29% of total CBM tasks being scored twice. Scoring reliability for SV-S was 99%, and VM was 100%.

Data Analysis

Descriptive statistics were calculated prior to examining the research questions, including average VM, SV-S, and ISAT performance. To examine alternate form reliability, bivariate correlations were calculated across Forms 1 and 2 for VM and SV-S with results compared with standards set by Marston (1989) for CBM research (r ⩾ .70 = strong; r = .50 to .69 = moderate; and r ⩽ .50 = weak). The second research question related to evidence of criterion-related validity was examined via Pearson product correlations between mean scores from the VM, SV-S, and ISAT, and the Marston criteria imposed.

Predictive utility was examined using logistic regression and ROC curve analysis. Students scoring below 213 on the ISAT (e.g., below proficient) were coded as 1 (at-risk, n = 85) and those scoring 213 or above (e.g., proficient to advanced) were coded as 0 (no risk, n = 120). AUC values were interpreted using the following guidelines set by the Center on Response to Intervention (n.d.) specifically for screening tools: below .80 demonstrates “unconvincing evidence,” between .80 and .90 suggests “partially convincing evidence,” and greater than .90 demonstrates “convincing evidence.” We used ROC curve analysis to estimate diagnostic characteristics for each measure based on 90% sensitivity, per previous studies (Conoyer et al., 2016; Johnson et al., 2009).

Results

After reviewing descriptive statistics, the ranges for SV-S measures do not suggest floor or ceiling effects. However, there was a relatively small floor effect (3%) and small ceiling effects (3% and 22%, respectively) found for both VM forms. The skewness and kurtosis for the correct responses on both versions of the SV-S and VM forms ranged from −0.26 to 0.51 and from −1.19 to 0.59, respectively. To meet the normality assumption, the recommended range of the absolute value is within 2 for skewness and kurtosis (Tabachnick & Fidell, 2012). See Table 2 for the complete list of descriptive statistics.

Table 2.

Descriptive Statistics for All Measures.

Measure	N	Minimum	Maximum	M	SD	Skewness	Kurtosis	Floorn (%)	Ceilingn (%)
SV-S Form A	205	4.0	57	24.53	9.03	0.51	.47	0	0
SV-S Form B	205	5	58	25.90	9.89	0.60	.59	0	0
SV-S Mean Score	205	4.5	53	25.22	8.49	0.43	.54
VM Form A	205	0	20	12.52	4.86	−0.68	−.21	4 (1.95)	6 (2.93)
VM Form B	205	0	20	15.09	5.30	−1.20	.58	6 (2.93)	47 (22.93)
VM Mean Score	205	0	20	13.81	4.75	−1.08	.46
ISAT	205	193	247	214.98	9.99	0.251	−.03

Note. SV-S = Statement Verification for Science, VM = Vocabulary-Matching, ISAT = Idaho Standards Achievement Test.

Alternate-Form Reliability

VM coefficients were strong (r = .75, p ⩽ .01) compared with SV-S which were moderate (r = .61, p ⩽ .01). There was a moderate relation between the means of VM and SV-S (r = .65, p ⩽ .01).

Criterion-Related Validity

To address our second research question, criterion validity was calculated by correlating each measure with the ISAT Science Assessment standard score. Results indicated a strong, positive relation between the mean of VM and the ISAT (r = .74, p ⩽ .01) and a moderate, positive relation between the mean of SV-S and the ISAT (r = .58, p ⩽ .01).

Predictive Utility

To examine the accuracy of the VM and SV-S measures in predicting proficiency on the ISAT science assessments, we report the results of the logistic regression and ROC curve analysis in Table 3. To compare the classification accuracy based on sensitivity, the first row includes results from a typical logistic regression analysis. The second row includes the cut score for each measure estimating 90% sensitivity of the associated specificity levels, AUC statistics, and the resulting classification accuracy.

Table 3.

Classification Indices for SV-S and VM Measures With the State Standardized Science Score.

Measure	N	Sensitivity	Specificity	Cut score	ROCAUC	CI 95%	TP	FP	TN	FN	Classification accuracy
SV-S	205	58	82	22			49	22	98	36	72%
SV-S		90	30	31	.78	[.72, .85]	77	84	36	8	55%
VM	205	74	91	13			63	10	110	22	84%
VM		90	60	16	.90	[.85, .94]	77	48	72	8	73%

Note. The first row for each measure reports results using the logistic regression; the second reports ROC analysis results when sensitivity was set as close to 90% as possible. SV-S = Statement Verification for Science; VM = Vocabulary-Matching; ROC = receiver operating characteristic; AUC = area under the curve; CI = confidence interval; TP = true positives; FP = false positives; TN = true negatives; FN = false negatives.

The classification rate for SV-S according to the ROC curve analysis controlling for 90% sensitivity (55%) was less than those obtained with logistic regression results (72%). SV-S demonstrated an acceptable AUC statistic (.78), with 95% confidence intervals ranging from .72 to .85. In contrast, the classification rate for VM measure mirrors a similar, but less extreme, reduction in classification accuracy when sensitivity is set to 90%. Logistic regression classification accuracy rate for VM was 84%, while the ROC curve analysis rate was 73%. The VM measure demonstrated an outstanding AUC statistic (.90), with 95% confidence intervals ranging from .85 to .94.

Discussion

This study replicated a pilot investigation that examined the differences in technical adequacy between two similar screening CBM tools. The tools were compared on alternate form reliability, criterion validity, and classification accuracy. One of the main goals of the study was to compare the differential pattern between the two similar measures in the areas of alternate form reliability and concurrent criterion validity. For our first research question, we examined the evidence of alternate form reliability for each measure. A strong coefficient (.75) was found for VM measures, similar to previous studies in science (Espin et al., 2013) as well as social studies (Beyers et al., 2013; Espin et al., 2001; Lembke et al., 2017). A moderate coefficient (.61) was found for the SV-S forms, contrasting previous work demonstrating coefficients between .39 and .58 (Ford & Hosp, 2017). This pattern of the VM measure outperforming the SV-S measure is similar to results found in the Ford et al. (2018) study; however, the SV-S measure produced stronger coefficients with this larger sample. We attribute this performance to the increase in items included on the form and the appropriate administration time (in the initial study, SV-S administration was shortened). An interesting finding is that when correlating the mean score of the VM and SV-S forms only, a moderate relationship was found (.65). This may imply that while the items are measuring similar constructs, identifying science content knowledge requires different skills beyond knowing only content vocabulary.

Our second research question examined the evidence of concurrent criterion validity between the CBM tools and the ISAT. Once again, the VM measure demonstrated a strong relationship with the ISAT (.74). The results are similar to previous studies that compared VM measures with state standardized science assessment scores (Espin et al., 2013). The SV-S measure demonstrated a moderate relation to the ISAT (.58). This pattern of VM measures producing higher coefficients compared with SV-S measures is similar to results found in the Ford et al. (2018) study; however, SV-S produced stronger coefficients with the ISAT than it did with a different state science assessment. This is in contrast to previous studies that have shown criterion validity coefficients ranging from .11 to .40 in middle school grades when comparing SV-S measures with the state science assessment (Ford & Hosp, 2017; Ford et al., 2018). We attribute these findings to comparing SV-S with a different state science assessment, appropriate administration procedures, and having 60 items on the SV-S form.

Our third research question addressed the predictive utility of the VM and SV-S tools for the state standardized science assessment. As the majority of previous work in this area has been with correlational analysis, we were concerned with how accurately these measures predict the likelihood of whether a student was going to pass or fail a high-stakes test in science. The results suggested that VM measures were able to classify students more accurately compared with SV-S measures. With this particular sample of students and the ISAT, the SV-S measure performed at an unconvincing level, but a convincing level for VM measure according to the criteria established by the Center on Response to Intervention (n.d.). As expected based on previous studies, the classification accuracy percentage declined slightly when the sensitivity was set to 90% and the cut scores increased to potentially identify more students as being at-risk. The number of false positives dramatically increases when sensitivity is set at 90% and providing intervention to those that may not need it becomes an issue (Conoyer et al., 2016; Johnson et al., 2009).

Limitations

Limitations related to this study included a narrow sample, survey instrumentation lacking validation, and lack of generalization to other state tests. Our sample consists of students from one school district in one state and thus a representation of the areas in which the study was conducted. As a result, the sample lacks diversity in race, learning, and linguistic ability. This indicates that generalizability of our results will be limited to other populations. Finally, while the ISAT science assessment has been endorsed for measuring skills and concepts in science, it is unlikely that it will be similar to other state science assessments. Despite many states aligning to more national standards such as NGSS, generalizing the results of our findings to other state standardized science measures would not be appropriate at this time.

Practical Implications and Future Directions

Overall, the VM measure performed as a stronger predictor of performance on ISAT science assessment compared with the SV-S measure. This is an interesting finding given that the lack of adequate validity and reliability of SV-S measures in the pilot study was mostly attributed to administration and instrumentation errors and the small sample size. Even when these confounds were remediated in the replication, the results indicated that VM measures produced higher reliability and validity coefficients. Previous studies have shown that VM measures have outperformed other reading tasks such as oral reading fluency and maze (Espin & Foegen, 1996) as well as sentence verification (Mooney, Lastrapes, Marcotte, & Matthews, 2016) for students at the middle school level in the area of reading and science. One explanation that we have regarding this outcome is there may be some type of reading comprehension skill that is a moderating or mediating variable for VM, SV-S, or both measures. Perhaps asking students to identify the meaning of a word (i.e., VM) is more aligned with learning of content area vocabulary than having to read and comprehend an entire sentence to indicate if the items are “true” or “false” (i.e., SV-S). It is possible that SV-S items are acting as more of a reading comprehension measure than a content knowledge indicator. This would align with the fact that SV-S measures are modeled off of Sentence Verification Technique, which is a reading comprehension task (Ford et al., 2018).

However, prior to suggesting that VM measures be implemented for screening purposes in schools, more research is necessary to address measurement development. First, we would suggest a different approach to determining the sample of items for VM measures from standards such as NGSS and Common Core. This may assist in creating a more robust indicator of performance. If schools are ultimately using screening data to identify students at risk of not meeting proficiency on state assessments that are more aligned to national standards (e.g., NGSS), then perhaps this method of development would be more beneficial than pulling key vocabulary from curriculum, textbooks, or teacher notes.

We also suggest critical examination of these items through a formal analysis of grade-level terms and construct validity using item response theory coupled with enhanced examination of content validity by conducting expert reviews of items in the field of science education. Obtaining content validity coefficients from expert reviewer agreement would also provide further evidence that the screening measures are capturing science constructs (Cohen & Swerdlik, 2018). Additionally, using these measures as part of an intervention study that employs them as pre–post test measures could demonstrate additional validity evidence. For example, if students show growth from pretest to posttest after being taught the construct (i.e., science vocabulary), this may also provide evidence of good construct validity (Dimitrov & Rumrill, 2003).

Finally, it may also be important to determine what ways we can provide more items to students in a VM format that addresses the limitations of the measure such as ceiling effects, overwhelming lists of words, and the process of elimination and/or guessing. This could be accomplished with using computer assisted or even computer adaptive technology. While previous technological-based measures have been examined for measuring student progress in acquiring science content, they have either not incorporated a VM task or have not investigated the technical adequacy of such measures for screening purposes (Marino & Beecher, 2010; Vannest, Parker, & Dyer, 2011).

Conclusion

We sought to extend the research in science CBM by replicating a previous pilot study that examined differences between the VM and SV-S measures with middle school students. While stronger technical adequacy and better prediction to state test proficiency were found for VM measures, future research is necessary to critically examine content-area CBM development, construct validity, and incorporate computer assisted administration formats.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Beyers

S. J.

Lembke

E. S.

Curs

(2013). Social studies progress monitoring and intervention for middle school students. Assessment for Effective Intervention, 38, 224-235. doi:10.1177/1534508413489162

Bravo

M. A.

Cervetti

G. N.

(2008). Teaching vocabulary through text and experience in content areas. In Farstrup

A. E.

Samuels

S. J.

(Eds.), What research has to say about vocabulary instruction (pp. 130-149). Newark, DE: International Reading Association.

Busch

T. W.

Espin

C. A.

(2003). Using curriculum-based measurement to prevent failure and assess learning in the content areas. Assessment for Effective Intervention, 28(3&4), 49-58.

Center on Response to Intervention. (n.d.). Screening tools chart rating system. Retrieved from https://rti4success.org/resources/tools-charts/screening-tools-chart/screening-tools-chart-rating-system

Cohen

R. J.

Swerdlik

M. E.

(2018). Psychological testing and assessment: An introduction to tests & measurement (9th ed.). New York, NY: McGraw-Hill.

Conoyer

S. J.

Foegen

Lembke

E. S.

(2016). Early numeracy indicator: Examining predictive utility across years and states. Remedial and Special Education, 37, 159-171. doi:10.1177/0741932515619758

Cook

Collins

Cook

(2016). A replication by any other name: A systematic review of replicative intervention studies. Remedial and Special Education, 37, 223-234.

Coyne

M. D.

Cook

B. G.

Therrien

W. J.

(2016). Recommendations for replication research in special education: A framework of systematic, conceptual replications. Remedial and Special Education, 37, 244-253. doi:10.1177/0741932516648463

Dimitrov

D. M.

Rumrill

P. D.

Jr. (2003). Pretest-posttest designs and measurement of change. Work, 20, 159-165.

10.

Espin

C. A.

Busch

Lembke

E. S.

Hampton

D. D.

Seo

Zukowski

B. A.

(2013). Curriculum-based measurement in science learning: Vocabulary-matching as an indicator of performance and progress. Assessment for Effective Intervention, 38, 203-213.

11.

Espin

C. A.

Busch

Shin

Kruschwitz

(2001). Curriculum-based measures in the content areas: Validity of vocabulary-matching measures as indicators of performance in social studies. Learning Disabilities Research and Practice, 16, 142-151. doi:10.1111/0938-8982.00015

12.

Espin

C. A.

Foegen

(1996). Validity of general outcome measures for predicting secondary students’ performance on content-area tasks. Exceptional Children, 62, 497-514.

13.

Espin

C. A.

Shin

Busch

T. W.

(2005). Curriculum-based measurement in the content areas: Vocabulary matching as an indicator of progress in social studies learning. Journal of Learning Disabilities, 38, 353-363.

14.

Fang

(2006). The language demands of science reading in middle school. International Journal of Science Education, 28, 491-520.

15.

Ford

J. W.

Conoyer

S. J.

Lembke

E. S.

Smith

R. A.

Hosp

J. L.

(2017). A comparison of two content area curriculum-based measurement tools. Assessment for Effective Intervention, 43, 121-127. doi: 10.1177/1534508417736753

16.

Ford

J. W.

Hosp

J. L.

(2017). Statement verification for science: Theory and examining technical adequacy of alternate forms. Exceptionality.

17.

Gersten

Clarke

Jordan

N. C.

Newman-Gonchar

Haymond

Wilkins

(2012). Universal screening in mathematics for the primary grades: Beginnings of a research base. Exceptional Children, 78, 423-445.

18.

Glover

T. A.

Albers

C. A.

(2007). Considerations for evaluating universal screening assessments. Journal of School Psychology, 45, 117-135.

19.

Hosp

J. L.

Ford

J. W.

(2014). Investigating the relation of a modified sentence verification task of science knowledge with the science test of the Iowa Assessments (Internal Iowa Measurement Research Foundation report, unpublished).

20.

Hosp

J. L.

Ford

J. W.

Huddle

S. M.

Hensley

K. K.

(2018). The importance of replication in measurement research: Using curriculum-based measures with postsecondary students with developmental disabilities. Assessment for Effective Intervention, 43, 96-109. doi:10.1177/1534508417727489

21.

Hosp

M. K.

Hosp

J. L.

Howell

K. W.

(2016). The ABCs of CBM: A practical guide to curriculum-based measurement (2nd ed.). New York, NY: Guilford.

22.

Idaho State Department of Education. (2017). ISAT assessments online test administration manual English language arts/literacy, mathematics and Science. Retrieved from http://idaho.portal.airast.org/wp-content/uploads/FINAL-2016-17-Idaho-Test-Administration-Manual_v3.pdf

23.

Jenkins

J. R.

Hudson

R. F.

Johnson

E. S.

(2007). Screening for at-risk readers in a response to intervention framework. School Psychology Review, 36, 582-600.

24.

Johnson

E. S.

Jenkins

J. R.

Petscher

Catts

H. W.

(2009). How can we improve the accuracy of screening instruments? Learning Disabilities Research & Practice, 24, 174-185.

25.

Lee

Miller

E. C.

Januszyk

(2014). Next generation science standards: All standards, all students. Journal of Science Teacher Education, 25, 223-233.

26.

Lembke

E. S.

Allen

Cohen

Hubbuch

Landon

Bess

Bruns

(2017). Progress monitoring in social studies using vocabulary matching curriculum-based measurement. Learning Disabilities Research & Practice, 32, 112-120. doi:10.1111/ldrp.12130

27.

Lemons

C. J.

King

S. A.

Davidson

K. A.

Berryessa

T. L.

Gajjar

S. A.

Sacks

L. H.

(2016). An inadvertent concurrent replication: Same roadmap, different journey. Remedial and Special Education, 37, 213-222.

28.

Makel

M. C.

Plucker

J. A.

Freeman

Lombardi

Simonsen

Coyne

(2016). Replication of special education research: Necessary but far too rare. Remedial and Special Education, 37, 205-212.

29.

Marino

M. T.

Beecher

C. C.

(2010). Conceptualizing RTI in 21st-century secondary science classrooms: Video games’ potential to provide tiered support and progress monitoring for students with learning disabilities. Learning Disability Quarterly, 33, 299-311.

30.

Marston

D. B.

(1989). A curriculum-based measurement approach to assessing academic performance: What it is and why do it. In Shinn

M. R.

(Ed.), Curriculum-based measurement: Assessing special children (pp. 18-78). New York, NY: Guilford Press.

31.

Mooney

Lastrapes

R. E.

(2016). The benchmarking capacity of a general outcome measure of academic language in science and social studies. Assessment for Effective Intervention, 41, 209-219. doi:10.1177/1534508415624648

32.

Mooney

Lastrapes

R. E.

(2018). Replicating criterion validity in science content for the combination of critical content monitoring and sentence verification technique. Assessment for Effective Intervention. Advanced online publication. doi:10.1177/1534508418758362

33.

Mooney

Lastrapes

R. E.

Marcotte

A. M.

Matthews

(2016). Validity of two general outcome measures of science and social studies achievement. Specialusis Ugdymas/Special Education, 34, 145-188.

34.

National Center for Educational Statistics. (2016). The nation’s report card: 2015 science at grades 4, 8 and 12. Washington, DC: Institute of Education Sciences, U.S. Department of Education.

35.

Royer

J. M.

Hastings

N. C.

Hook

(1979). A sentence verification technique for measuring reading comprehension. Journal of Reading Behavior, 11, 355-363.

36.

Schmidt

(2009). Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Review of General Psychology, 13, 90-100. doi:10.1037/a0015108

37.

Tabachnick

Fidell

(2012). Using multivariate statistics (6th ed.). Boston, MA: Pearson.

38.

Therrien

W. J.

Benson

S. K.

Hughes

C. A.

Morris

J. R.

(2017). Explicit instruction and Next Generation Science Standards aligned classrooms: A fit or a split? Learning Disabilities Research & Practice, 32, 149-154.

39.

Vannest

K. J.

Parker

Dyer

(2011). Progress monitoring in grade 5 science for low achievers. The Journal of Special Education, 44, 221-233.

40.

Youngstrom

E. A.

(2013). A primer on receiver operating characteristic analysis and diagnostic efficiency statistics for pediatric psychology: We are ready to ROC. Journal of Pediatric Psychology, 39, 204-221.