Abstract
In the present study, two types of curriculum-based measurement (CBM) tools in science, Vocabulary Matching (VM) and Statement Verification for Science (SV-S), a modified Sentence Verification Technique, were compared. Specifically, this study aimed to determine whether the format of information presented (i.e., SV-S vs. VM) produces differences in alternate form reliability and validity of scores or any differences in accuracy of prediction of scores on the state standardized science assessment. Overall, 25 eighth-grade science students were administered two SV-S and two VM forms with identical items along with spring eighth-grade maze passages from Aimsweb. Students had recently taken the eighth-grade state science test. Results regarding technical adequacy for each CBM tool were consistent with past findings. However, this study extends the literature base on CBM tools in science by providing evidence for using standards to develop VM forms. In addition, despite probable ceiling effects, additional evidence was found for the potential of SV-S as a CBM tool in science.
Vocabulary is identified as a critical element in basic reading skills, and its interconnectedness with comprehension can have a substantial impact on the learning of secondary students in content areas (Busch & Espin, 2003). The importance of having a strong vocabulary is especially true for students in the area of science, as vocabulary knowledge impacts the development of science conceptual knowledge and engagement in science and the inquiry process (Espin et al., 2013). Furthermore, content areas require students to learn a wide range of new vocabulary within the academic language of the content (Mooney & Lastrapes, 2016). Measures that not only address reading comprehension but also content area knowledge are necessary to allow educators to make data-based decisions in areas such as science. Thus, the screening and monitoring of student progress in vocabulary or academic language in science may be beneficial to educators.
Curriculum-based measurement (CBM; Deno, 1985) has been widely accepted as a valid and reliable technology for assisting educators with making data-based screening (i.e., identifying students at risk of academic difficulties) and progress decisions (i.e., measuring growth over time) in reading, mathematics, and writing (M. K. Hosp, Hosp, & Howell, 2016). CBM tools in content areas such as social studies (e.g., Beyers, Lembke, & Curs, 2013; Espin, Busch, Shin, & Kruschwitz, 2001; Mooney, McCarter, Russo, & Blackwood, 2013), and more recently science (e.g., Borsuk, 2010; Espin et al., 2013; Ford & Hosp, in press; J. L. Hosp & Ford, 2014; Mooney & Lastrapes, 2016; Mooney, Lastrapes, Marcotte, & Matthews, 2016) have also increasingly been examined.
The aforementioned studies mostly emphasized Stage 1 CBM research (Fuchs, 2004) by focusing on technical adequacy development of the static score for making screening decisions, while others also addressed Stage 2 CBM research focusing on the technical adequacy of the slope of scores for making progress decisions (typically in response to intervention). Those studies examining Stage 1 CBM research in science mostly focused on Vocabulary Matching (VM). The Sentence Verification Technique (SVT) has also been examined as a tool for measuring students’ science content knowledge (Denton et al., 2011; Marcotte & Hintze, 2009). We describe these tools below.
VM in Science
VM is a 5-min task that requires students to pair a set of definitions with their appropriate terms from content area materials (see Espin et al., 2001). VM was developed based on the association of comprehension and vocabulary knowledge (Busch & Espin, 2003) and is specifically used for detecting students who demonstrate difficulties in acquiring content area vocabulary (Mooney & Lastrapes, 2016). Previous studies have provided evidence that VM produces reliable scores that predict performance in content areas, specifically in middle school science (Borsuk, 2010; Espin et al., 2013; Mooney & Lastrapes, 2016). These studies report alternate form reliability coefficients ranging from r = .64 to .84 (Espin et al., 2013) and validity correlations with standardized outcome measures ranging from r = .47 to .76 (Espin et al., 2013; Mooney & Lastrapes, 2016). In a limited number of studies where growth was examined, rates have ranged from .26 to .63 vocabulary matches per week (Borsuk, 2010; Espin et al., 2013).
Although VM is the most thoroughly researched CBM tool in content areas thus far, there are some limitations for its use as it is currently developed. First, the sampling of items from textbook glossaries can be burdensome, and previous studies have suggested that alternative methods of development may also need to be explored (Mooney et al., 2013). In addition, one’s ability to match vocabulary words with their proper definition does not necessarily reflect an understanding of content knowledge (i.e., a student may complete VM using the process of elimination). Thus, using VM as a proxy for measuring students’ science content knowledge may be of concern. In addition, the time students are allowed to complete VM, given the number of items included in the task, may result in ceiling effects (i.e., many students answer all items in the time they are given) so that a fluency score cannot be calculated.
SVT
The SVT is a measurement tool that requires students to read three to four short 12-sentence passages as well as a related set of test sentences. Two of the test sentences accurately reflect the passages, with one being directly taken and the other being slightly modified while maintaining meaning. The other two test sentences do not accurately reflect the passages, with one serving as a complete distracter and the other being only slightly modified as to change meaning. Students read the passages and determine whether the test sentences maintain the same meaning by indicating “yes” or “no.” SVT has been investigated as a tool for assisting teachers with making formative decisions in reading comprehension, with promising results (Marcotte & Hintze, 2009). Given its design, SVT may address the potential issue of students matching vocabulary words to definitions without adequate understanding of content when completing VM (due to the need to comprehend read passages of content and respond to items in SVT). Marcotte and Hintze (2009) found SVT to have a moderate, positive relation with Reading CBM (R-CBM; i.e., oral reading fluency; r = .57) and The Group Reading Assessment and Diagnostic Evaluation (GRADE; Williams, 2001; r = .59). Marcotte and Hintze (2009) also found SVT, oral reading fluency, and maze to account for approximately 40% of student performance on the GRADE (Williams, 2001). In addition, Denton et al. (2011) found the accuracy of SVT for making screening decisions about students’ overall reading skills to be congruent with oral reading fluency.
However, SVT has some limitations as a tool for decision making in content areas. For example, having students respond to a passage they have just read may produce results that are too proximal to be a good indicator of overall student content knowledge and reading comprehension. Also, the content that can reasonably be included in a SVT passage can only cover a small component of the curriculum, students are taught. Thus, a student’s performance on one particular passage would likely not reflect their overall content knowledge. To address these limitations, J. L. Hosp and Ford (2014) developed Statement Verification for Science (SV-S).
SV-S
The SV-S is a modified SVT task. To complete SV-S, students do not read a passage and respond to questions about what they just read but, instead, they respond to a series of 60 “true” and “false” items developed in conjunction with state science standards and the Next Generation Science Standards (NGSS; Lee, Miller, & Januszyk, 2014) for 3 min. By using these standards-based items to replace SVT passages, students’ ability to accurately respond depends on knowledge they have obtained via science instruction. Thus, to perform well on SV-S, students must be able to accurately read information about science, determine what is being asked by identifying appropriate academic vocabulary, and decide whether what they have read is accurate. In addition, by providing students with a sufficient number of items and a time limit for administration, rate of performance can be calculated. In initial investigations of SV-S, J. L. Hosp and Ford (2014) first tested 945 items consisting of science statements based on state and NGSS (Lee et al., 2014) with 1,846 students in Grades 7 and 8 at three junior high schools in the Midwest. Using two-parameter logistic item response modeling, 843 items were retained and concurrently calibrated. A separate follow-up study was conducted using students in Grade 8 from one of the participating junior high schools (N = 331) to determine the optimal number of items to include on alternate forms as well as optimal administration time (J. L. Hosp & Ford, 2014). These two studies were conducted with internal grant funding and found in the same report of those findings.
In a separate study, Ford and Hosp (in press) examined the technical adequacy of alternate forms of SV-S for students in Grades 7 (N = 799) and 8 (N = 746). Alternate form reliability coefficients were found to range from r = .41 to .58 in Grade 7 and r = .39 to .43 in Grade 8. Ford and Hosp (in press) also examined the evidence of criterion-related validity by comparing students’ performance on SV-S and a statewide, high-stakes test of accountability for science and reading (convergent) and mathematics (divergent). For science, validity coefficients ranged from r = .11 to .24 for Grade 7 and r = .18 to .33 for Grade 8, while for reading, validity coefficients ranged from r = .08 to .26 for Grade 7 and r = .19 to .34 for Grade 8. For mathematics, validity coefficients ranged from r = .12 to .23 for Grade 7 and r = .22 to .35 for Grade 8. Recognizing a lower than desired relation between SV-S and the criterion was observed, Ford and Hosp (in press) speculated that these findings may be due to SV-S items being aligned with state science standards and the NGSS (Lee et al., 2014), while the statewide test is intended to measure a broader domain of science knowledge (thus not necessarily being expected to align with such standards). Indeed, research has consistently found the alignment between statewide, high-stakes accountability tests and state standards to be problematic (Squires, 2012). For example, the Iowa Department of Education investigated the alignment of the Iowa Assessments with the state reading and mathematics standards and found that only 24% of the items in Grade 7 were aligned with mathematics standards and 48% of items were aligned with the Grade 8 reading standards (Data Recognition Corporation, 2013).
Given the novelty of SV-S, and emerging evidence for VM, further investigation of the most efficient format is necessary for educators to practically create tools to support data-based decision making. These tools were also compared with an established CBM tool (maze) for additional evidence of criterion-related validity purposes. Maze is commonly viewed as a quick measure of reading comprehension, and while it is technically a general outcome measure in reading, the use of this measure aligns well with goals that teachers would have of enhancing students’ ability to comprehend text (M. K. Hosp et al., 2016). This study investigated two science CBM tools (VM and SV-S) in eighth-grade science. Our specific research questions were as follows:
Method
Participants
The study was conducted in an eighth-grade science classroom in a rural school district in the Midwest. The district had a student enrollment of 799. A sample of convenience included 25 White students, 11 of whom were male, and one student with an IEP. Almost half of the participating students (45%) qualified for free/reduced lunch, and all students were native English speakers.
Measures
SV-S
SV-S items from item testing (J. L. Hosp & Ford, 2014) were reviewed to determine which items had the highest discrimination values. Although past research (J. L. Hosp & Ford, 2014) suggested including 60 items on SV-S forms, we included 20 on each SV-S form to account for the need to create parallel VM forms that only include 20 terms and 22 definitions.
Next, items were included on SV-S Form A and SV-S Form B in an alternating fashion (e.g., the item with the highest discrimination value was included on Form A and the next highest on Form B, etc.). This was done to keep discrimination values comparable across forms. After SV-S forms were created, they were reviewed and edited as necessary to ensure that similar items were not included on the same form (e.g., “Liquid is a state of water”; “Liquid is not a state of water.”). As a result, SV-S Form A had 20 items with an average discrimination value of 0.085 for items, and SV-S Form B had 20 items with an average discrimination value of 0.088. Furthermore, the resulting difficulty values across SV-S form were comparable as well (−0.109 and −0.114 for SV-S Forms A and B, respectively).
Standardized directions including two sample items on the first administration of an SV-S form were provided to ensure understanding of the task. Students were instructed that they would have 3 min to read each statement silently, decide whether the statement was correct or incorrect, and then fill in the corresponding circle in the yes or no column. Two SV-S forms were administered; a raw score of the total number correctly identified statements for each probe and the mean of the two forms were calculated.
VM CBM
Once SV-S forms were created, statements from each form were translated into a VM format to create two forms (VM Form A and VM Form B). To accomplish this, one vocabulary word was selected from each SV-S statement and similar language from the statement was used to create a VM definition. Using these procedures allowed us to claim that the content of SV-S and VM forms were similar and only the format differed. Although our approach is different than that typically used to develop VM forms (in which item development is based on the vocabulary of a textbook used), we determined it necessary to examine differences between VM and SV-S absent the confounding variable of different content.
Each probe was then formatted according to guidelines established in previous studies (e.g., Espin et al., 2001) and consisted of 20 terms listed alphabetically on the left side of the page and 22 definitions (including two distracters) listed in random order on the right side of the page. Flesch–Kincaid analysis revealed SV-S and VM forms to be similar in reading difficulty, with grade readability indices ranging from Grade 6 to 7.
Standardized directions were provided and students were instructed that they would have 5 min to match the words on the left-hand side of the page with their definitions by writing the letter of the correct definition in the blank next to each word. They were also instructed that they would have two definitions that would not be used. Two vocabulary-matching measures were administered; a raw score of the total number of correct matches for each probe and the mean of the two forms were calculated.
Aimsweb R-CBM maze
The Aimsweb maze task is a reading task that uses passages between 150 and 400 words. The first sentence is left intact and then every seventh word is replaced with three choices in parentheses. The multiple-choice items consist of the correct answer and two distracter items (Shinn & Shinn, 2002). Alternate form reliability coefficients for Grade 8 are reported to be approximately .75, with concurrent validity with state standardized assessments falling at .55 (Pearson, 2012). Students had 3 min to read the passage and choose the correct answers. Three Aimsweb maze probes were administered, and the median score for each student was used as a criterion measure.
Missouri Assessment Program—Science (MAP-S) subtest
CBM tools were administered concurrently (within 2 weeks) with the Missouri Assessment Program—Science (MAP-S; Missouri Department of Elementary and Secondary Education [DESE], 2015). Science MAP is an untimed standards-based test administered yearly in Grades 5 and 8 by the school district and scored by DESE. Items consist of multiple-choice, short answer, and essay responses. Scores are provided as a scale score on a continuum with scores ranging in value from 470 to 895 for science. A score of 540 to 670 is considered Below Basic, 671 to 702 is Basic, 703 to 734 is Proficient, and 735 to 895 is Advanced. For internal reliability, Cronbach’s alpha ranged from .86 to .91, indicating acceptable reliability (DESE, 2015). The MAP served as a criterion measure for our study.
Procedures
All CBM tools were group administered to students during their science class across four different eighth-grade classes. Students received a packet with three maze passages, two SV-S forms, and two VM forms. Eight different assessment packets were randomly distributed to participants, with either maze passages at the beginning or end of the packet and VM and SV-S forms counterbalanced either after or before the maze task.
For maze, standardized administration procedures were adapted from Shinn and Shinn (2002). For SV-S, administration directions were adopted from the SV-S item studies (J. L. Hosp & Ford, 2014). Administration procedures for VM were adopted from Espin et al. (2001). Prior to engaging in each CBM task, students were given an example item. One researcher trained in CBM administration and familiar with all the CBM tasks administered the packets. A second researcher observed for fidelity of administration and found fidelity was between 98% and 100% across all administrations. Three researchers scored the CBM tasks and entered their scores into a spreadsheet. All scores were double entered and any disagreements were discussed and resolved. Ten of the student’s packs were pulled at random for scoring reliability, equaling 29% of total CBM tasks being scored twice. Scoring reliability across all maze CBM was 99%, SV-S CBM was 99%, and VM CBM was 100%.
Data Analysis
Prior to examining our research questions, we calculated descriptive statistics for all measures. Descriptive statistics were also calculated for median maze, average VM, and SV-S, and MAP performance. Then, to examine our first research question regarding alternate form reliability, bivariate correlations were calculated across Probes 1 and 2 for VM and SV-S with results compared with standards identified by Marston (1989) for CBM research. Such standards are as follows: r ≥ .70 = strong; r = .50 to .69 = moderate; and r ≤ .50 = weak. Next, to examine our second research question related to criterion validity, bivariate correlations for average performance on VM, SV-S, maze, and MAP were calculated. We used a Bonferroni correction of p < .008 to account for multiple comparisons. Furthermore, the standards suggested by Marston (1989) for CBM research described above were utilized as a standard of comparison for interpreting the strength of validity coefficients. Last, to answer our final research question related to predictive accuracy, we conducted a series of Fisher’s r to z transformations prior using Meng’s z (Meng, Rosenthal, & Rubin, 1992) to determine whether VM or SV-S was statistically significant better predictor to maze or MAP performance. Meng’s z (Meng et al., 1992) has been shown to be a simple, but accurate, method for comparing the relation between a dependent variable and a set of independent variables to determine whether a difference exists in the ability of the latter to predict performance on the former.
Results
Table 1 includes descriptive statistics for maze, VM, SV-S, and MAP. After an initial screening for missing data, four individuals were removed from the sample. Each measure was judged for deviations of skewness and kurtosis with values above 1.0 considered questionable and above 2.0 problematic (Tabachnick & Fidell, 2013). Our initial examination of descriptive statistics found issues with the distribution of performance on SV-S. Specifically, we observed questionable levels of skewness and kurtosis for SV-S Probe 1 (−1.41 and 1.42, respectively) and problematic levels of each for SV-S Probe 2 (−2.29 and 6.40, respectively). Due to this observation, we identified statistical outliers in our sample (i.e., students performing two standard deviations above or below the mean) on any measure (maze, VM, SV-S, or MAP). This resulted in four additional individuals being removed from our sample for a total N of 25 students included in the full analysis.
Descriptive Statistics for CBM Tools and MAP (N = 25).
Note. CBM = curriculum-based measurement; MAP = Missouri Assessment Program—Science; VM = Vocabulary Matching; SV-S = Statement Verification for Science.
After removing statistical outliers, all maze passages (and thus the median), MAP performance, and VM performance continued to demonstrate acceptable levels of skewness and kurtosis (see Table 1). However, removing statistical outliers was observed to have mixed results regarding the distribution of performance on SV-S. Specifically, levels of skewness and kurtosis remained questionable for SV-S Probe 1 and, while levels of each were more normally distributed for SV-S Probe 2, the average performance on SV-S with outliers remained questionable for skewness and problematic for kurtosis.
In sum, by removing statistical outliers, we were better able to obtain a normally distributed sample to conduct further analyses to answer our research questions. Issues with the distribution of SV-S performance are likely attributed to the observation that 22 of 25 students answered at least 15 of 20 questions for Probe 1 and all 25 students did so for Probe 2. We discuss this issue further in our “Discussion” section. However, despite concerns regarding our sample’s distribution, we believe it prudent to answer the research questions of our pilot study to extend the literature base in using CBM in content areas.
In regard to our first research question examining alternate form reliability, there was a strong, positive relation between VM Probes 1 and 2 (r = .73), which were statistically significant at p < .0001. For SV-S, the relation between probes was positive and weak (r = .45). The relation was statistically significant at p = .02.
With regard to our second research question, evidence of criterion-related validity was demonstrated in a moderate, positive relation between VM and maze (r = .65) as well as between VM and MAP (r = .71). Both relationships were statistically significant at p < .001. A weak, positive relation was found for SV-S and maze (r = .21) and SV-S and MAP (r = .40). After adjusting for multiple comparisons, neither relation was found to be statistically significant.
Table 2 shows the results for our third research question related to examining differences in prediction (Note: When using Meng’s z, the confidence intervals [CIs] describe the z on the chi-square distribution). VM was a better predictor of maze performance compared with SV-S (z = 2.66, p = .008; 95% CI = [0.47, 2.75]) and of predicting MAP performance (z = 2.09, p = .04; 95% CI = [0.37, 2.18]). No difference was found between VM and maze (z = −1.71, p = .08; 95% CI = [−0.45, −1.80]) or SV-S and maze (z = 0.35, p = .73; 95% CI = [.22, .44]), when predicting MAP performance.
Differences in Prediction for VM/SV-S to Criterions, Meng’s z (N = 25).
Note. VM = Vocabulary Matching; SV-S = Statement Verification for Science; CBM = curriculum-based measurement; CI = confidence interval; MAP = Missouri Assessment Program—Science.
p < .05.
Discussion
The purpose of this pilot study was to compare the use of VM and SV-S in a middle school science classroom. More specifically, we attempted to advance the research examining the technical adequacy of these CBM tools in science as well as investigate whether differences might exist for such tools in their relation to a traditional CBM tool for measuring reading (i.e., maze). In addition, we also investigated whether differences for VM, SV-S, or maze existed in their relation to a statewide, high-stakes test of accountability for science (i.e., the MAP).
With regard to alternate form reliability, the relation between our VM probes (r = .73) was consistent with previous research reporting a range from .64 to .84 (Espin et al., 2013). This is encouraging as we used a slightly different approach to developing probes by considering state science standards consistent with the NGSS (Lee et al., 2014). Previous approaches to developing VM probes utilized course content such as textbook glossaries, teacher notes, and unit quizzes and exams (Beyers et al., 2013; Espin et al., 2001). The relationship between our SV-S probes (r = .45) was also consistent with previous research that found alternate form reliability to range from r = .39 to .58 for students in Grades 7 and 8 (Ford & Hosp, in press).
With regard to evidence of criterion-related validity, we again found evidence of coefficients for VM being consistent with previous studies (e.g., Espin et al., 2013). Moreover, the relation we observed between VM and MAP was stronger than other studies examining CBM tools in content areas when examining evidence of concurrent criterion-related validity (Mooney et al., 2016). Given the nature of how VM probes were developed for our study (i.e., based on SV-S probes), it is surprising that a statistically significant relation between VM and maze was found and only a weak relationship between SV-S and the traditional R-CBM tool was observed. This may be due to ceiling effects related to not reducing administration time for SV-S to account for a reduction in items from the original forms. However, the relationship between SV-S in our study and the MAP is congruent with results from a previous study examining evidence of criterion-related validity with a different state’s high-stakes test of accountability (Ford & Hosp, in press). That is, while a mostly weak, positive relationship between SV-S and statewide, high-stakes tests of science has been found, this likely has to do with the need for criterion measures of science content knowledge to better align with specific standards (see Squires, 2012).
With regard to examining difference in prediction, given the strength of the relationships we observed for VM, SV-S, and maze to MAP, it is reasonable that only VM was found to be a better predictor of students’ MAP performance. In a similar way, the strength of the relationship we observed for VM and SV-S to maze also likely accounts for why the former was again found to be a better predictor. However, the moderate, positive relationship between VM and SV-S (r = .60) suggests that the aforementioned ceiling effects for SV-S may again be skewing the true relation between SV-S and other measures of students’ performance.
Study Limitations
As highlighted above, ceiling effects for SV-S likely created an issue with measuring students’ performance. Such effects could be the result of the tool not being as challenging for students in this pilot study compared with previous research. More likely, however, is that students were provided with too much time considering the reduced number of items they were asked to complete for us to compare SV-S with VM. Thus, future research comparing these two CBM tools for science should either reduce the amount of time given with fewer items or administer the probe for 3 min and include more items with the first 22 items matching the VM probe terms.
In addition, we also recognize the limitation of having a small convenience sample from a rural Midwest school. As such, the results of our pilot study may not generalize to other populations. In particular, we recognize that having only one student with an Individualized Education Program (IEP) may mean generalization of our results to students having difficulty meeting academic expectations may be especially limited.
Future Research and Conclusion
Given the importance of using assessment data to inform instruction, future research should continue to examine the best tools for making screening and progress decisions regarding students’ science knowledge. The results of our pilot study indicate that additional research in technical adequacy is needed in this area, specifically for SV-S. Thus, our results are interpreted with caution as immediate implications for practice are likely not present. Instead, our pilot study is best viewed as another step toward developing general knowledge about the use of CBM in science. Additional study, using a larger sample size (and addressing our issues with SV-S administration and development), are necessary prior to being able to infer how our results—if replicated—might assist educators with making screening and progress decisions regarding students’ science content knowledge. Such research should include studies using larger sample sizes from more diverse populations. In addition, specific methods for developing items and comparing VM and SV-S are needed that better align with the original structure of SV-S which still allow for such a comparison to be made.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
