Abstract
Star Math (SM) is a popular computer adaptive test (CAT) schools use to screen students for academic risk. Despite its popularity, few independent investigations of its diagnostic accuracy have been conducted. We evaluated the diagnostic accuracy of SM based upon vendor provided cut-scores (25th and 40th percentiles nationally) in predicting proficiency on an end of year state test in a sample of highly achieving grade three (n = 210), four (n = 217), and five (n = 242) students. Specificity exceeded sensitivity across all grades and cut-scores. Acceptable levels of sensitivity and specificity were achieved in grade three and four but not grade five when using the 40th percentile.
Universal screening is a core component of Response to Intervention frameworks, providing a means of early identification which leads to the delivery of preventive supports (Kettler et al., 2014). Four outcomes exist when students complete a universal screening assessment: true positive (TP), false positive (FP), false negative (FN), and false positive (FP; Kettler et al., 2014). A student who is at-risk of failure who is incorrectly identified by a screener as not being at risk would be considered an FN. Conversely, students identified as at-risk whom are truly not at risk can be considered FPs. Ideal outcomes are TPs and TNs, where students who truly are and are not at risk, are identified accurately as such.
Sensitivity (Sn) and specificity (Sp) are two metrics to assess the diagnostic accuracy of a cut-score to identify students at-risk on the universal screener. Sensitivity is the proportion of students whom are truly at-risk identified as such. Specificity is the proportion of students truly not at-risk identified as such (Kettler et al., 2014). An ideal cut-score achieves an appropriate balance between Sn and Sp, and aligns with school resources and student need.
Math is an important area to accurately identify students in need of support, due to its ties to future academic success and high stakes exam performance (Nicholas & Berliner, 2007). Previous studies support the value of math screening tools in predicting students’ end of year performance on state exams (Fuchs et al., 2011). A common assessment used to screen for academic difficulties in math is Star Math (SM) published by Renaissance Learning (2017). Studies that consider diagnostic accuracy of SM are limited in that they are often provided by the publisher despite the tool being used by over 34,000 schools/districts in the country (Renaissance Learning, n.d.). The purpose of this study is to provide an independent evaluation of the diagnostic accuracy of default cut-scores from SM in a sample of elementary-aged students. Default cut-scores were selected to mimic how educators would use the tool in their school. Research questions include (a) To what degree is SM performance predictive of student performance on state exams (i.e., Pennsylvania System of School Assessment; PSSA)? and (b) Do default cut-scores effectively identify students who need additional support to succeed on state exams?
Method
Participants and Setting
Demographic Characteristics Across Grades 3–5.
Measures
Data Analysis
Descriptive statistics were calculated for each grade using the SM Fall scaled scores and PSSA Spring scaled scores. Pearson’s correlation between Fall SM scores, and Spring PSSA scores were also calculated across grades. Sensitivity and specificity were derived for each grade across vendor provided cut-scores for risk (e.g., 25th and 40th percentiles) predicting performance levels of “Basic” and “Proficient.” Sensitivity was derived by calculating the number of TPs divided by the total TPs plus FNs within the sample. Specificity was determined by calculating the total TNs divided by the sum of TNs and FPs. Positive and negative predictive values, or the ratio of students that were truly at-risk to all those identified as at-risk (TP/[TP+FP]), and the ratio of students that were truly not at-risk to those identified as not-at risk (TN/[TN+FN]) were also computed.
Results
Descriptive and Correlational
Descriptive Statistics.
Note. SM – Star Math; PSSA – Pennsylvania System of School Assessment.
Diagnostic Accuracy
Diagnostic Accuracy Results of Star Math predicting Levels of Performance on Pennsylvania System of School Assessments.
Note. Percentile – Fall Percentile Cut-Score used with Star Math to predict performance on Pennsylvania System of School Assessments. BR – Base Rate; TP – True Positive; TN – True Negative; FP – False Positive; FN – False Negative; Sn – Sensitivity; Sp – Specificity, PPV – Positive Predictive Value, NPV – Negative Predictive Value.
Diagnostic accuracy results when predicting Proficient levels of performance are also presented in Table 3. When using the 25th percentile as a cut-score, Sn did not exceed .30 across all grade levels. Specificity values were similar to those observed when predicting Basic performance. Considering the 40th percentile, Sn values tended to increase but did not exceed those observed when predicting Basic performance across each grade. Specificity values remained largely unchanged.
Discussion
The present study addressed the diagnostic accuracy of SM in predicting elementary students’ proficiency on the PSSA. Research questions considered the correlation between Fall SM performance and Spring PSSA outcomes, along with variations in Sn and Sp with cut-scores at the 25th and 40th percentiles. The strong, positive correlation between Fall SM and Spring PSSA outcomes provides support for the predictive validity of SM. Sensitivity and specificity are also vital to consider when assessing the effectiveness of SM at identifying students in need of additional support. SM was highly accurate at identifying students who were not in need of additional support in the present sample. SM was less accurate at identifying students in need of additional support, particularly when using the 25th percentile threshold in higher grades. Increased Sn was observed when adjusting the cut-score from the 25th to 40th percentile in all grades but remained at suboptimal levels. This highlights the value of increasing cut-scores to more accurately catch those students in need of assistance, but underscores that additional data may be needed prior to making “rule-in” decisions if using SM.
These results have implications for identification and cut-score selection. In most circumstances, Sn and Sp should be balanced (Kettler et al., 2014). This study highlights the need to take into account local context, particularly in very high or low achieving settings, when selecting cut-scores since publisher-provided cut-scores may not be accurate at identifying students at risk of academic failure. In this study, the population was an overall high performing group of students. The publisher provided cut-scores did not accurately identify students who would not achieve proficiency on the PSSAs consistently. In any case, adjusting cut-scores to be more or less stringent will have direct implications on the number of students identified as at-risk and by extension the number of students incorrectly identified. For instance, if one were to prioritize identifying all students that may need supplemental intervention, one could increase the cut-score students need to exceed on SM. This would lead to an increase in Sn by virtue of identifying more students at risk. At the same time, there is an increased chance of FPs, decreasing Sp (or the accuracy of rule-out decisions). Conversely, if a school had limited resources and wanted to only identify those students at most risk for later difficulties, they could decrease the cut-scores students need to score above to be considered not at risk. This in turn will increase Sp (increasing the accuracy of rule-out decisions) at the cost of Sn (by virture of identifying fewer students).
A method to promote more accurate identification, specific to a population markedly different than a norm group, is the use of locally derived cut-scores. Research supports the value of using locally derived cut-scores (Leblanc et al., 2012), especially for the purpose of maximizing the balance between Sn and Sp (Silberglitt & Hintze, 2005). Future research may consider the effectiveness of SM for identifying academically at-risk students when locally derived cut-scores are applied.
One specific finding was relatively perplexing, the poor performance of grade five cut-scores relative to grade three and four. Namely, FN rates in grade five were larger than outcomes from other grades. This suggests that there was an increased likelihood SM would identify students not at-risk and a proportion of students were likely to not go on to achieve Basic or Proficient levels of performance on the end of year test. This may reflect that the content and or questions between SM and the PSSAs are more aligned with one another at lower grade levels. Similarly, a review of the content described on the grade five PSSA suggests that the depth and breadth of skills assessed are greater than those assessed in lower grades. Understanding the limitations of the cross-sectional nature of the present dataset, the base rate of students in grade five for Proficient levels of performance (.35) relative to grade three (.27) and grade four (.24) lends some support to this hypothesis.
The present study had some potential limitations. First, the sample lacked diversity, in that the majority of students across grades were White. This was also an overall high performing sample of students. Future research should investigate the effectiveness of SM when applied with diverse populations, and a student body with a greater level of academic need. In addition, demographic information was not available for student gender. Only two publisher provided cut scores were evaluated for Sn and Sp. It would be worthwhile to investigate the application of locally derived cut-scores for this and other samples markedly different from the SM norm group. However, the original purpose of this study was to evaluate the technical adequacy of cut-scores provided by SM to schools. It is also worth noting that high stakes exams like the PSSAs may not be equally valued as a criterion measure across different regions and communities (January & Ardoin, 2015).
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
