Abstract
U.S. elementary schools administer reading screeners to identify students in need of remedial instruction. However, the administration of additional assessments comes with a cost. It is unclear the extent to which multiple types of reading screeners warrant the increase in resources that could be used for instruction. This study compared cost–accuracy ratios for three types of reading screeners in Grade 3: curriculum-based measurement (Acadience), computer adaptive assessment (Star), informal reading inventory (Fountas and Pinnell Benchmark Assessment System), and the cost–accuracy of using all three in conjunction. We used classification and regression tree analysis to identify local cut-scores and identify how measures could be combined to maximize classification accuracy. Results suggested that oral reading fluency score (Acadience) yielded the best cost–accuracy ratio, but the combination of Star and oral reading fluency identified important instructional groups. Cost tables provide additional insight to schools on critical decision points for choosing and implementing reading screeners.
Screening students’ reading development can lead to more individualized and effective instruction that improves elementary reading outcomes (Connor, 2019), which can lead to later literacy development, improved graduation rates (Blachman et al., 2014), and fewer negative life outcomes associated with illiteracy (Cree et al., 2012). Many states require that all students in Kindergarten through Grade 3 be identified for risk of not meeting grade-level reading proficiency (Council of Chief State School Officers & Center on Enhancing Early Learning Outcomes, 2019). The primary purpose of screening is early identification of students in need of remedial instruction at a time when the effects of reading intervention can have the largest long-term benefits (Blachman et al., 2014). As more schools have adopted reading screeners, questions about the best types, acceptable levels of accuracy, timing and number of measures to administer, and how to use scores to make instructional decisions have become pressing (Compton et al., 2010).
Examining how schools conduct screening is critical because the resulting decisions provide remediation for some students and not others, and the assessments may take the place of instruction that may have yielded other benefits (i.e., opportunity cost; Levin et al., 2018). The purpose of the study was threefold: (a) to develop a metric for applying the concept of cost effectiveness to the context of screening in schools, (b) to illustrate a method for determining the accuracy of using reading screeners to identify groupings of student performance that may be instructionally useful (Kiernan et al., 2001), and (c) to compare the costs and cost–accuracy of common reading screening approaches.
Reading Screening Assessment Systems in Elementary Schools
Three types of reading screeners predominate: curriculum-based measurement (CBM), computer adaptive assessments (CAT), and informal reading inventories. Each approach differs in several aspects valued by schools: technical adequacy, instructional utility, and administration time and costs (summarized in Table 1). Curriculum-based measurement in reading (CBM-R; Deno, 2003) has been used extensively as a universal screener (Wayman et al., 2007). Examples of commercially available CBM products include Acadience (which was used in this study), DIBELS, AIMSweb, and FASTBridge. CBM includes brief assessments, focused on critical skill areas in reading and its items are intended to be discrete so that progress can be more easily measured. CBM has high technical adequacy with multiple equivalent forms to facilitate the measurement of incremental growth (Deno, 2003). For instructional utility, educators use some metrics within CBM-R to adjust their instruction for building fluency, decoding, and phonemic awareness.
Characteristics of Three Approaches to Reading Screening.
Note. Acadience Reading Composite and Star Reading technical adequacy data retrieved from National Center on Intensive Intervention (2021), unless otherwise noted. Fountas & Pinnell Benchmark Assessment System data retrieved from Heinemann (2012), unless otherwise noted. ORF = Oral reading fluency; F&P BAS = Fountas & Pinnell Benchmark Assessment System; NR = Not Reported.
School determined this cut-score for risk based on Heinemann’s grade level range. bRetrieved from Powell-Smith et al. (2012). cAcadience Reading reliability and validity technical adequacy data retrieved from Dewey et al. (2015), except as noted. dRetrieved from Renaissance Learning, Inc. (2022). eRetrieved from Good et al. (2019).fReported as effect size (Cohen’s d). gConvergent validity ranged from .42 to .44 (nonfiction and fiction books, respectively) with Degrees of Reading Power texts to .93 to .94 (fiction and nonfiction books, respectively) with Reading Recovery assessment texts (Heinemann, 2012).
CAT includes a series of multiple-choice items, administered via technology. CAT is based in item response theory and selects easier or more difficult items, based on student performance on previous items. This approach provides more precise information and reduces administration time and errors compared with assessments based on classical test theory (Wainer et al., 2000). However, administration is typically longer than CBM. There are several technically sound commercial CATs for reading screening (e.g., FASTBridge, iReady, NWEA MAP Growth, RAPID; National Center on Intensive Intervention, 2021). These assessments vary in instructional utility as some provide subscale scores of the strands of state standards and some provide scores related to different skills identified by the science of reading. For example, the NWEA MAP Growth gives scores for the Literature and Informational Reading strands of the Common Core State standards. Other CATs provide subscale scores aligned with research-based domains of reading (e.g., phonemic awareness, decoding, vocabulary, reading comprehension) useful for guiding selection of intervention (e.g., iReady, RAPID). CATs are more reliable for capturing growth across school years compared with CBM-R because they do not rely on the grade level of the passages (see Mitchell et al., 2015).
Informal reading inventories (e.g., Benchmark Assessment System, Developmental Reading Assessment) typically involve a series of leveled texts in which students read the entire passage aloud and answer reading comprehension questions. Teachers determine the student’s independent reading level based on a combination of reading accuracy, fluency, and reading comprehension. Compared with CBM or CAT, inventories are more widely used by general education teachers for making instructional decisions, such as forming guided reading groups, because they have easily interpretable scores (e.g., grade level of the text) (Ford & Opitz, 2008). Utility for screening is limited due to the potential for high rates of administrator error, lack of technical adequacy (Klingbeil et al., 2015), lack of bias analysis, time-intensive administration, and larger measurement error due to lack of text equivalence across forms (Tortorelli, 2019). However, educators have been using informal reading inventories for much longer than either CBM or computer adaptive assessments have existed, and use of informal inventories is considered as a critical skill to inform classroom instruction (Provost et al., 2010).
Multivariate Screening
In multivariate screening, several screeners are administered to all students and then a combination of scores is used for decision-making. Research on multivariate screening is mixed, with some finding that it improves accuracy (Compton et al., 2010), some finding that it can decrease accuracy (VanDerHeyden et al., 2018), and others recommending that the very small improvement in accuracy is not worth the opportunity cost required to administer multiple assessments (Klingbeil et al., 2017; VanDerHeyden et al., 2018).
If additional screeners are detracting from teachers’ time for instruction or intervention, schools need to justify what benefit each measure provides (Clemens et al., 2016). Collectively, these trade-offs include (a) teachers’ time spent in training, administering, and scoring assessments instead of instruction; (b) opportunities for error in weighing multiple scores and score types; and (c) financial resources that could be spent on instructional materials rather than assessment materials. These trade-offs can be collectively quantified as “opportunity costs” (Levin et al., 2018), and represent how we conceptualized and calculated costs in the current study. Furthermore, experts recommend that schools evaluate the cut-scores used in their screeners because schools may use different outcomes (e.g., different state tests) than the vendor’s study and each school may have a different base rate of non-proficiency on the outcome, which both effect classification accuracy (Schatschneider et al., 2008).
Finally, to begin to understand the potential instructional utility of multiple screeners, we use classification agreement and regression tree (CART) analysis, opposed to prior research which has typically used multiple regression. Unlike regression, CART is a nonparametric procedure that identifies groups of students that are homogeneous in their performance on screeners rather than assuming students fall in a rank order on a normal distribution of overall reading performance. In this way, CART has utility for identifying groups of students with similar instructional needs (Kiernan et al., 2001). Next, we describe cost–accuracy ratios, which combine opportunity cost with classification accuracy to evaluate the value of each screening approach.
Combining Cost With Classification Accuracy in a Cost–Accuracy Ratio
Cost–accuracy analysis is conceptually similar to cost-effectiveness analysis, a formal methodology that calculates cost-effectiveness ratios by dividing per-student costs by the effect (e.g., effect size) of an evidence-based program (per student cost/effect) (Levin et al., 2018). In cost–accuracy ratios, we replaced effect with two measures of accuracy: overall classification accuracy and negative posttest probability. Overall classification accuracy is likely the simplest to understand because it is the number of students correctly identified by the screener. Negative posttest probabilities indicate improvement in probability of correct identification of risk in a given context (base rate). Put another way, negative posttest probability is the probability of a student performing above the cut-score on the screening, but then scoring below proficiency on the year-end test. This is usually the type of error that schools want to avoid. For example, if negative posttest probability were 10% in a context where 59% of students failed the year-end test, then the probability of correct identification of true positives was improved by 49%. On the contrary, if negative posttest probability were 55% in a context where 59% of students failed the year-end test, use of the measure should be questioned because it did not improve the probability of identification of risk in that particular context. In the case of overall classification accuracy, the cost–accuracy ratio reflects the cost to accurately identify one student. In the case of negative posttest probability, the cost–accuracy ratio reflects the cost to reduce the probability of a false negative error by 1%.
Research Questions
In this study we sought to help schools weigh the value of various aspects of their screening system by asking:
Method
Sample
Archival data were collected from 114 Grade 3 students (110 in the analysis) in six classrooms in one elementary school in the U.S. state of Michigan during the 2018–2019 school year. The school served 618 students and was designated as a Title I school with 65% of the Grade 3 students qualifying for free or reduced-price lunch. Students’ scores on the Grade 3 state achievement test were not significantly different from the mean state score for Grade 3, t (112) = −0.66, p = .510. Student demographics are further detailed in Table 2.
Sample Characteristics for Reading Screening Tests (N = 110 Grade 3 Students).
Screening Measures
Although the school administered reading screeners three times per year, this study used the fall screening scores collected in September for the analyses. We examined the fall scores because they are the most applicable for screening, allowing for early identification and remediation during the fall semester. See Table 1 for the psychometric properties of each of the measures. We focused on the cost–accuracy of the fall screening scores, instead of the winter or spring scores, as remedial instruction provided in response to the fall screener maximizes the likelihood of prevention and early intervention.
Curriculum-Based Measurement
The Acadience Reading assessment provides a composite score that weighs four scores using the following formula from the technical manual: Acadience Reading Composite score = Oral Reading Fluency + (2 × Retell) + (4 × Maze Adjusted Score) + Oral Reading Accuracy (Gray et al., 2021). The four scores and composite were entered in the multivariate analysis to determine which scores carried the highest utility.
Oral Reading Fluency and Accuracy
Students read three passages aloud for 1 min each. The assessor marked word substitutions, omissions, and pauses of >3 s as errors. The oral reading fluency score was calculated by subtracting the number of errors from the total number of words read for each passage. The recorded oral reading fluency score was the median words read correctly per minute from the three passages. Oral reading accuracy for each passage was calculated by dividing the words correct by the number of correct words plus incorrect words (i.e., total words read), then multiplying by 100%. The median percentage of correctly read words was recorded as the oral reading accuracy score.
Retell
Following oral reading of each passage, the student was asked to retell the story. The assessor tracked the total number of retell words by marking a line through consecutive numbers in the scoring booklet for each word the student said that related to the passage. The median total number of words recalled was recorded.
Maze
Maze was measured separately in a group format. Students were asked to silently read a passage for 3 min in which every seventh word was replaced with a box containing the missing word and two distractor words. The Maze score was calculated as the correct responses minus incorrect responses divided by 2. An adjusted Maze score (found using a table published by Acadience) was used for the composite score.
Computer Adaptive Assessment
The Star Reading assessment (Renaissance Learning, Inc., 2022) is a group-administered test that consisted of 34 multiple-choice vocabulary-in-context questions and reading comprehension questions. Student performance resulted in a scaled score ranging from 0 to 1,400, which was used in the analysis.
Informal Reading Inventory
In the Benchmark Assessment System, teachers chose to administer either an informational or literary text. Accuracy was measured by the percentage of words read correctly on the full text and comprehension was rated on a 4-point scale. The assessor used accuracy and comprehension to identify the students’ instructional level (i.e., Levels A–Z). Level N is designated as the beginning of Grade 3 (Fountas & Pinnell, 2016). For analyses, the letter was translated into a numerical scale (i.e., A = 1, B = 2).
Criterion Measure: Michigan Student Test of Educational Progress (M-STEP)
The Michigan Student Test of Educational Progress (M-STEP) in English Language Arts (ELA) is a summative computer-adaptive assessment given annually to students in Grades 3 through 7 (Michigan Department of Education [MDE], 2019). The 2019 M-STEP was derived from the Smarter Balanced Assessment Consortium (SBAC) assessments (variations used by 12 other states). Students received an overall scaled score which ranged from 1,203 to 1,357. For Grade 3 ELA, 1,300 was the cut-score for the proficient level (MDE, 2019).
Cost Data
Cost data were collected retrospectively during the fall of 2021 through semi-structured interviews with the school staff to comprehensively capture all of the resources needed to implement the assessments as intended (i.e., the ingredients method; Levin et al., 2018). The interview protocol was based on Hollands et al. (2021) and asked school staff about the quality and quantity of resources needed for implementation (see online supplementary materials for the interview protocol). In addition, school staff were asked to review the school records or budgets to answer some of the questions. The interviews were recorded and then a member checking process was used to ensure the trustworthiness of the data.
For each assessment program, costs were incurred for financial costs to purchase the assessment program, materials, and data management system; personnel costs for assessor training, administration, monitoring, and scoring the assessments; personnel costs for substitute teachers; and equipment for administering and scoring the assessments (e.g., Chromebooks). Financial values were assigned to units within each of the ingredients based on the school’s Enterprise Resource Planning (ERP) software program. To maximize generalizability, some values were estimated using the CostOut Toolkit (Hollands et al., 2015), based on the year they were incurred. Fixed costs for training and equipment were annualized over the lifetime of the resource to account for the depreciation of resources over time, as well as the interest accrued for the non-depreciated portion (Levin et al., 2018). Chromebooks were annualized over 5 years; the remaining fixed costs were annualized over 3 years, based on the semi-structured interviews, and both were assumed to have a standard 5% interest rate. All assessments were administered in the general education classroom; therefore, facilities costs were excluded from the cost analysis.
Procedures
During the 2018–2019 school year, school staff administered all three reading screeners to Grade 3 students.
Acadience Reading
Acadience Reading was administered via paper and pencil materials by a school-wide assessment team, which included a district-level Multi-Tiered System of Supports (MTSS) Coordinator and six contracted assessors. The assessment team administered the entire Acadience Reading Grade 3 reading battery, which included individually administered ORF, Accuracy, and Retell measures and a group-administered Maze measure. The ORF, Accuracy and Retell measures took approximately 5 to 7 min to administer plus 3 to 5 min to build rapport with the student and provide instructions, resulting in a total of 10 min per child for administration. The group-administered Maze measure took 10 to 15 min total for administration and scoring. Each member of the school-wide assessment team scored and entered data on their own computers. Scoring took approximately 3 to 5 min per student. This process cumulatively took approximately 30 min per benchmark. Fidelity of implementation data was not available; however, the school-wide assessment team attended 1 full-day group training provided by the MTSS Coordinator, in conjunction with training videos available through Acadience, to learn standardized administration and scoring.
Star Reading
Star Reading was group-administered using individual Chromebooks. The assessment took approximately 20 min to complete and teachers allocated 30 min for the entire assessment. Owing to the computer-based nature of the assessment, teachers did not receive any formal professional development regarding administration. Rather, the MTSS Coordinator emailed teachers scripted directions prior to administration to remind them about how to introduce the assessment and administration procedures, but implementation fidelity data were not available.
Benchmark Assessment System
In Grade 3, the general education classroom teachers administered the Benchmark Assessment System screeners, while a substitute teacher covered their classroom. Administration occurred over 2 days per classroom per benchmark, which consisted of approximately 40 min per student, including transitions and instructions.
To prepare, teachers watched the training 2-hr video included in the purchased teacher kits, which covered administration and scoring, and the MTSS Coordinator facilitated this activity. Furthermore, to increase the likelihood of standardized administration and scoring over time, the MTSS Coordinator developed a document summarizing the protocol and this document was distributed prior to each benchmark.
Data Analysis
Classification Accuracy
We conducted three CART analyses using SAS 9.4 on each screener separately and one multivariate CART analysis that combined the scores from the three screeners. CART models partition each student’s performance at every possible cut-score on each assessment to split the sample into mutually exclusive subgroups in incremental steps, selecting the scores with the most utility first. The variables entered into the CART model for the predictor variables were: (a) BAS level converted to a numeral, (b) the scaled score from the BAS, (c) Acadience words read correctly per minute, (d) Acadience percentage of words read correctly, (e) Acadience retell, and (f) Acadience Maze. The dependent variable was dichotomous proficiency on the M-STEP. Unlike regression analyses, which attribute overlapping variance between the first and subsequent variables to the first variable, CART simultaneously considers all scores to identify the scores that most effectively split participants into proficient and non-proficient categories. By using this nonparametric splitting approach, CART is not limited by collinearity of predictor variables as is logistic regression. By default, SAS uses an entropy-based split criterion, a cost-complexity pruning method and subtree evaluation criterion with a 10-fold cross-validation procedure, which seeks to reduce the average misclassification rate (SAS Institute Inc., 2015). SAS randomly assigns observations to fold in the cross-validation, which can result in slightly different results; therefore, we set the initial seed for random number generation at 123 to facilitate replication of the results. CART analysis generates (a) a decision tree that provides the optimal cut-scores in the predictor variable(s), (b) the number of students classified in each node, which are homogeneous groups of student performance, and (c) a 2 × 2 contingency table indicating the number of true and false positives and true and false negatives.
The contingency tables were used to calculate classification accuracy. In this study, we used the overall classification accuracy because of its ease of interpretation for schools in the cost–accuracy ratio and the negative posttest probability statistic because it accounts for base rates of risk on the screener in a specific population (VanDerHeyden, 2013). We also report sensitivity and specificity values for context because these are typically reported. Formulas for each are provided in the supplementary online materials. Sensitivity values >.90 and negative posttest probabilities near or <.10 are considered as optimal for screening purposes (Jenkins et al., 2007; VanDerHeyden, 2013).
Cost Analysis
For the cost and cost–accuracy analyses, ingredients identified during the semi-structured interviews were first determined to be fixed (did not vary based on the number of students) or variable (varied based on the number of students). All of the ingredients or resources are listed in Tables 4 to 6. Fixed costs included professional development, teacher kits, and other classroom- or teacher-level resources. Variable costs included student materials and equipment, and other student-level resources. Per student costs were calculated by multiplying the units by the unit prices for each individual ingredient, summing the costs across all ingredients, and then dividing the total costs for each program by the number of students. Costs were calculated for the fall benchmark alone to align with the classification accuracy analyses, as well as for the entire academic year to align with the typical educational practice.
Cost–accuracy ratios were calculated for the overall classification accuracy by dividing the total costs by the number of accurately identified students (total cost/[true negative + true positive]). Cost–accuracy ratios for negative posttest probability were calculated by first subtracting the negative posttest probability from the base rate, which indicates improvement in the probability of correct identification of risk in a given context. Then, the total costs were divided by this difference score, to indicate the costs to obtain gains in probability of correct identification of risk above those that could be obtained by chance alone. Because cost-effectiveness (and cost accuracy) analyses usually convert metrics to a per student value to assist in comparability of results across studies, we computed the cost to improve (lower) negative posttest probability by 1% (see online supplementary materials for equations). Finally, supplementary analyses were conducted to understand the extent to which the cost–accuracy ratios were robust to variations in the assumptions.
Results
Descriptive Statistics
Complete data were available for 110 students. The 4 missing data points were missing at random, Little’s Missing Completely at Random test χ2(5) = 5.391, p = .370. The Benchmark Assessment System was left-skewed and leptokurtic (peaked) at Level N (beginning of Grade 3), with very few students scoring higher than Level N and the left tail of the distribution going down to Level C. Scores on the other screeners were approximately normally distributed. Scores were statistically significantly lower on the Acadience Composite Score, t(110) = −4.35, p < .001, relative to the norm sample, but statistically similar to the norm sample on Acadience ORF, Star Reading, the Benchmark Assessment System, and the M-STEP.
Classification Accuracy and Multivariate CART Analysis
Classification accuracies for the screeners are provided in Table 3. None of the single measures on their own met recommended values of 90% sensitivity and 80% specificity (Jenkins et al., 2007) using the publisher-recommended cut-scores for calculations. These are provided for reference because the use of publisher-recommended cut-scores is the typical practice in schools.
Classification Accuracy for Four Approaches to Screening Using Cut-scores Optimized for the Current Sample.
Note. The base rate of non-proficiency for this sample was 59%. Classification accuracies should be compared across measures using cut-scores derived in the same way (CART-derived). Classification accuracies are provided for the publisher-recommended cut-scores as this approach reflects the typical practice of screening in schools and is a useful reference point demonstrating that the publisher-recommended cut-scores generate fewer correct decisions than do cut-scores derived on the actual sample. ORF = Oral reading fluency; CART = Classification agreement and regression tree.
See Figure 1 in the online supplementary materials for the decision tree illustrating the CART-derived cut-scores.
Next, to compare accuracies across the three individual measures and a multivariate combination of measures, classification analyses were conducted using cut-scores derived and tested on the study sample using CART analysis (see Table 3). For the multivariate screening, CART identified two groups of at-risk students and one group of not-at-risk students, and found that two scores best characterized student performance (see online supplementary materials for the CART decision tree). Students in the not at-risk category were characterized by scoring ≥345 on the Star Reading and reading ≥81 words correctly on the Acadience ORF (n = 45). For the at-risk categories, there was one group that scored <345 on the Star Reading (n = 51) and a group that scored >345 on Star Reading, but read <81 words correctly on Acadience ORF (n = 14). This model fit well with a low misclassification rate of <12 students (10.9%) misclassified. Other fit statistics included average square error = 0.09, entropy = 0.45, Gini = 0.18, residual sum of squares = 19.96.
Overall, the sensitivity and specificity values were stronger for Acadience and the CART-identified multivariate approach relative to STAR and the Benchmark Assessment System. Sensitivities were comparable at roughly 85% (Acadience) and 91% (multivariate) and specificities were identical at roughly 87%. Notably, the derived cut-score for Acadience selected only the ORF score rather than the other components of the composite.
Cost Analysis and Cost–Accuracy Analysis
Tables 4 to 6 present the costs per ingredient for Acadience Reading, Star Reading, and the Benchmark Assessment System, respectively. Results indicated that the total costs to implement Acadience Reading Composite for the fall benchmark was US$1,921.94 and the per student cost was US$17.47. The majority of the costs were related to personnel time for training and coaching (57.48%), and administration and scoring (34.74%), with very few out-of-pocket financial costs for the program and materials (7.78%). For Star Reading, results indicated that the total cost for the fall benchmark was US$2,258.11 and the per student cost was US$20.34. A smaller percentage of the costs was related to opportunity costs for personnel time for training, purchasing the programs, and administration (29.72%). However, opportunity costs for training may vary across administrations, as educators may not need the same level of support prior to each year of screening. The majority of the costs were for the Chromebooks (70.28%). Supplementary analyses indicated that excluding the costs for the Chromebooks, which may reflect technology-rich contexts in which student computers are readily available for other purposes, resulted in a total cost of US$671.14 for the fall and US$6.05 per student. Finally, for the Benchmark Assessment System, results indicated that the total cost for the fall benchmark was US$5,602.50 and the per student cost was US$50.93. Most of the costs were for personnel time to administer and score the assessments (78.35% for the fall).
Costs Per Ingredient for Acadience Reading Composite Score.
MTSS = Multi-Tiered System of Supports.
Costs Per Ingredient for Star Reading.
MTSS = Multi-Tiered System of Supports.
Costs Analysis for Benchmark Assessment System.
MTSS = Multi-Tiered System of Supports.
Table 7 presents the results of the cost–accuracy analyses which calculated the costs to correctly identify one student out of the entire sample (overall classification accuracy), as well as the costs to improve negative posttest probability in this context. Results suggested that Acadience Reading and Star Reading were the most cost-accurate options for correctly identifying students. Out of the total costs that it took to administer the screeners in the fall, it cost the school US$19.78 to correctly identify 1 student using Acadience Oral Reading Fluency and US$24.28 to correctly identify 1 student using Star Reading as either at risk or not at risk. Supplementary analyses suggested that Star Reading cost US$7.22 to correctly identify 1 student when excluding the costs of Chromebooks, which may reflect technology-rich environments where student computers are readily available for other purposes. The Benchmark Assessment System cost US$64.40 to correctly identify 1 student. Finally, analyses examining the cost-accuracy of the multivariate approach (Star Reading and Acadience ORF) suggested that it cost US$42.01 to correctly identify 1 student (overall classification accuracy). Table 7 also presents the cost-accuracy results for negative posttest probability. Acadience Oral Reading Fluency required the least amount of resources (US$124.78) to improve negative posttest probability by 1% (or stated another way, lowered the probability that the screener would fail to detect the student who was going to fail the year-end test). The multivariate approach was the second most cost-effective, costing US$196.28 to improve negative posttest probability by 1%.
Cost–Accuracy Results for Each Approach Using Cut-Scores Optimized for the Current Sample.
Note. Base rate of non-proficiency for these analyses was 59%. ORF = Oral reading fluency; TP = True Positive; TN = True Negative; FP = False Positive; FN = False Negative.
Discussion
This study provides several considerations for schools weighing the opportunity costs of screeners and the value they obtain from them. We provided quantitative information for schools to directly compare the costs associated with several aspects of screeners (e.g., training) and provided cost–accuracy ratios to inform decision-making. Our results provide further evidence that schools and researchers should check the classification accuracy of their screeners (Schatschneider et al., 2008). When using the publisher-recommended cut-scores in this study, sensitivity was remarkably low, detecting only about three to six out of every 10 children who needed remedial instruction. Negative posttest probability was near chance for the Benchmark Assessment System, only marginally better than chance for Acadience and STAR, and about three to four times the 10% maximum threshold recommended (VanDerHeyden, 2013). Our results suggest that schools request locally derived cut-points from vendors to improve their classification accuracy by adjusting for different base rates of proficiency that vary across schools.
In the current study, accuracy improved when sample-dependent cut-scores were derived from CART analysis and then classification accuracies were subsequently reported on the same sample. It was expected that accuracy would be higher because deriving thresholds and testing accuracies on the same sample inflates accuracy estimates (Jenkins et al., 2007). Results indicated that classification accuracy in all categories (except specificity) was highest for the multivariate approach, followed by Acadience ORF, with Star and Benchmark Assessment System demonstrating trade-offs between the classification accuracy metrics. The multivariate approach correctly identified four more students and had a negative posttest probability of 13.2%, which was much closer to the acceptable rate of 10% than the negative posttest probability of 20.4% for oral reading fluency alone. This replicates past findings that there is added accuracy with multiple measures. However, the cost was twice as much per student. Altogether, this suggests there is room for innovation to more efficiently capture the domains of reading that are not captured by oral reading fluency in Grade 3, such as oral language skills (Adlof & Hogan, 2019; Foorman et al., 2017) and these may be the constructs to measure instead of retell or Maze (because retell and Maze did not provide additionally useful information). Some commercially available assessments are beginning to measure both decoding and oral language (e.g., Lexia’s RAPID, Learning Ovation’s A2i). The economic trade-off between the longer administration time needed to assess both decoding and oral language with the benefit of differentiating which students need decoding and/or language instruction is an area ripe for research.
Acadience was the single measure that was most cost effective, but it was not the composite score that had the greatest accuracy, it was the ORF only. For this grade level at this school, the additional metrics did not improve accuracy. In fact, use of the additional measures and scores worsened accuracy. This is not surprising given the limited construct validity of retell and Maze. Although retell and Maze tasks have higher face validity as reading comprehension metrics, ORF has higher validity coefficients with criterion reading comprehension performance (Good et al., 2019). This may be due to the higher burden of decoding demands compared with language comprehension in Grade 3 text (Tortorelli, 2019).
The individual costs detailed in Tables 4 through 6 may be very helpful for school’s decision-making. For example, the training costs are high and costs suggest that schools should be wary of frequently switching assessment programs, particularly within one type of overarching approach (e.g., CBM or computer adaptive tests). Data suggest that this would require significant resources with minimal improvements in classification accuracy. Indeed, multiple assessment systems are quite similar. As such, more evidence is needed regarding which systems are most effective and efficient for school implementation.
For informal reading inventories like the Benchmark Assessment System, our results aligned with prior research suggesting that informal reading inventories increase misclassification as a screener (Klingbeil et al., 2015; VanDerHeyden et al., 2018). Therefore, the Benchmark Assessment System should not be used as a screener to determine who needs intervention. Rather, the value of informal reading inventories may lie in the opportunity for teachers to listen to students read aloud and formatively adjust their instruction by focusing on specific decoding errors, addressing fluency concerns, or monitoring comprehension. In other words, the Benchmark Assessment System may be used as an informal diagnostic assessment within a comprehensive assessment system as an opportunity for teachers to hear students read orally and answer comprehension questions. However, CBM also offers teachers the same opportunity to listen to students read aloud. Schools may wish to consider having teachers administer CBM measures, opposed to using a school-wide assessment team as was done in this study, rather than administering informal reading inventories. This might also free up resources to invest in professional learning on how to effectively use formative assessment of oral reading to guide instruction (Heritage, 2008).
One additional consideration is the domains of reading measured by the assessments. All three assessments measured some predictor of reading comprehension, but each approach offered different information about students’ development of domains of reading. The Star is most like the M-STEP; however, the Star score alone was not enough to capture students who needed remediation. Multivariate CART analyses suggested there were two groups of at-risk students in this school: (a) a group that needed remedial instruction in decoding, fluency, vocabulary, and/or reading comprehension, and (b) a group that only needed additional decoding and/or fluency-building instruction. It is possible that Star Reading misidentified a subgroup of students who had sufficient oral language proficiency to compensate for their more limited decoding and oral reading fluency skills to exceed the cut-score on Star but not meet proficiency on M-STEP. This misclassification of false negative errors may be more likely to occur in elementary school when the text is simple enough that students can infer meaning and correctly answer vocabulary and comprehension questions, even when they cannot decode all of the words or decode slowly. It is important to identify this group of students in Grades 2 and 3 (Fletcher et al., 2021) before texts become more complex. In later grade levels, these students will be detected on larger-grained assessments like Star. Models of reading fluency within the context of other reading domains support this interpretation that reading fluency is essential to measure for most students before 4th grade when oral language domains (i.e., vocabulary and syntax) become more predictive for most students (Foorman et al., 2017).
Experts in evidence-based reading assessment emphasize the difference between screening, progress monitoring, and diagnostic purposes (Fletcher et al., 2021). Brief, inexpensive, and highly predictive metrics like ORF were originally designed to inform instructional decisions and progress monitoring (Deno, 2003). Studies suggest that ORF can be used for both screening and progress monitoring. Carefully designed computer adaptive tests can be used to progress monitor reading comprehension on a less frequent basis (Petscher et al., 2017). Some screening measures have good psychometric properties to serve both screening and diagnostic purposes (e.g., A2i, [Connor, 2019] and FCRR Reading Assessment [Foorman et al., 2015]). These assessments more reliably identify the skills that need to be addressed within remedial instruction in the same or less amount of time than inventories. Research confirms that there are three types of poor readers who have different instructional needs: students who need decoding fluency instruction, students who need language instruction, and students who need intensive instruction in both (Adlof & Hogan, 2019; Foorman et al., 2017). Future research is needed to determine where this diagnostic purpose best fits within an MTSS approach.
Limitations
The conclusions drawn from this study should be tempered by several limitations. First, the study only included one school with a relatively small sample size, the examination of only a subset of available screening systems or programs, and no fidelity data were available. However, the sample size was comparable to other studies providing validity evidence of reading screeners (e.g., N = 184 Grade 3 students and 102 Grade 6 students; Good et al., 2019). The generalizability of the cost analyses and the classification accuracy may also be limited. For example, other schools may use screeners from different vendors, require more or less training, and costs to facilitate buy-in, which would alter the cost results (Barrett et al., 2020). Classification accuracy statistics and CART analyses are nonparametric analyses that apply specifically to the base rate of proficiency on the specific outcome used by this school and cannot generalize to a broader population. Furthermore, sources of error in CART analysis are still being explored. The at-risk subgroups of students study need to be replicated and explored in more robust analyses (e.g., Foorman et al., 2017). However, our results generally aligned with several prior studies, suggesting that future studies may not yield substantively different results.
Suggestions for Future Research
Most teachers already know the approximate level of their students’ reading development (e.g., VanDerHeyden et al., 2018) and derive more value from assessment that has direct implications for instruction (Ford & Opitz, 2008). Even when accurate screeners are used, many teachers are not provided with the knowledge or support to select evidence-based instruction that meets their students’ needs (Connor, 2019). Future research is needed in the space between screening and the delivery of remedial instruction. Furthermore, none of the assessments provided a reliable score for essential reading domains (e.g., vocabulary, morphology, syntax, text structure) that are necessary to improve reading comprehension skills, in addition to decoding (Connor, 2019; Truckenmiller & Brehmer, 2021). Future research is needed to determine if the addition of these domains could improve the accuracy and value of screening systems with minimal increases in time. Finally, this study only examined the cost–accuracy of fall screeners from specific vendors, although most schools administer screeners three times per year. Future research may wish to examine the cost–accuracy ratios of winter or spring screeners, include other types of screeners, or explore multi-gate approaches in which the different assessments are administered in a sequential fashion to subsets of students determined to be at risk (e.g., VanMeveren et al., 2020). These future studies are necessary to further understand reading screening systems and their implications for resource allocation and the delivery of remedial instruction.
Supplemental Material
sj-docx-1-rse-10.1177_07419325231190809 – Supplemental material for Comparing the Cost–Accuracy Ratios of Multiple Approaches to Reading Screening in Elementary Schools
Supplemental material, sj-docx-1-rse-10.1177_07419325231190809 for Comparing the Cost–Accuracy Ratios of Multiple Approaches to Reading Screening in Elementary Schools by Courtenay A. Barrett, Lindy J. Johnson, Adrea J. Truckenmiller and Amanda M. VanDerHeyden in Remedial and Special Education
Supplemental Material
sj-docx-2-rse-10.1177_07419325231190809 – Supplemental material for Comparing the Cost–Accuracy Ratios of Multiple Approaches to Reading Screening in Elementary Schools
Supplemental material, sj-docx-2-rse-10.1177_07419325231190809 for Comparing the Cost–Accuracy Ratios of Multiple Approaches to Reading Screening in Elementary Schools by Courtenay A. Barrett, Lindy J. Johnson, Adrea J. Truckenmiller and Amanda M. VanDerHeyden in Remedial and Special Education
Supplemental Material
sj-docx-3-rse-10.1177_07419325231190809 – Supplemental material for Comparing the Cost–Accuracy Ratios of Multiple Approaches to Reading Screening in Elementary Schools
Supplemental material, sj-docx-3-rse-10.1177_07419325231190809 for Comparing the Cost–Accuracy Ratios of Multiple Approaches to Reading Screening in Elementary Schools by Courtenay A. Barrett, Lindy J. Johnson, Adrea J. Truckenmiller and Amanda M. VanDerHeyden in Remedial and Special Education
Footnotes
Acknowledgements
The authors acknowledge the considerable efforts of the participating school and comments on the manuscript from Dr. Margaret Kuklinski.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
Support was provided by Grant H325H190003 for Lindy J. Johnson, from the Office of Special Education Programs (OSEP).
Supplemental Material
Supplemental material is available on the Remedial and Special Education webpage with the online version of the article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
