Abstract
We examined the diagnostic accuracy and efficiency of three approaches to universal screening for reading difficulties using retrospective data from 1,307 students in Grades 3 through 5. School staff collected screening data using the Measures of Academic Progress (MAP), a curriculum-based measure (CBM), and running records (RR). The criterion measure was a high-stakes state accountability test aligned with the Common Core State Standards. We examined the diagnostic accuracy of the tests in isolation, as multivariate batteries, and via a simulated gated-screening approach. CBM and RR data resulted in unacceptable diagnostic accuracy across all three grades. In the fourth grade, the MAP alone resulted in the best balance of sensitivity and specificity. Among third- and fifth-grade students, the multivariate combination of MAP and CBM demonstrated the best balance between diagnostic accuracy and efficiency. Gated-screening increased specificity but lowered sensitivity. Results highlight the need for population-specific considerations in universal screening.
Quick and accurate identification of students who need academic support is essential for the effectiveness of tiered service delivery systems (Glover & Albers, 2007). Accordingly, the process of universal screening has been widely adopted in schools (D. Fuchs, Fuchs, & Compton, 2012). Conducting accurate and efficient universal screening requires several important decisions. Educators must decide which skills to assess, how to assess those skills (e.g., paper-based or computerized assessments; individual or group administrations), and the number of measures to use. Then, educators must select from a number of assessments marketed as universal screening tools and make decisions on how to use the resulting data to identify students needing additional support. Given the number of procedural decisions associated with universal screening, it is unsurprising that screening practices vary across schools (Jenkins, Schiller, Backorby, Thayer, & Tilly, 2013; Prewett et al., 2012). Some resources exist to help educators select screening tools (e.g., Center for Response to Intervention’s screening tool charts), but there is less empirical research to guide schools in making procedural decisions around universal screening, particularly in upper elementary and middle school grades. For schools interested in using multiple assessments for screening purposes, additional research may be useful to quantify the relative benefits of different combinations of screening tests (including a consideration of testing efficiency). Thus, the purpose of this study was to examine the diagnostic accuracy of multiple screening measures and procedures for students in upper elementary grades via a retrospective analysis of screening data collected within a district’s multitiered system of support.
Evaluating Screening Measures
There are four possible results when educators use universal screening data to predict a dichotomous outcome (i.e., proficient/not-proficient). A true positive (TP) occurs when a student scores below the screening threshold (i.e., identified as at risk) and later fails the criterion test. A false positive (FP) occurs when a student scores below the screening threshold but passes the criterion test. A true negative (TN) occurs when a student scores above the screening threshold (i.e., identified as not at risk) and later passes the criterion test. Finally, a false negative (FN) occurs when a student scores above the screening threshold but later fails the criterion test. The distribution of students in each of the four categories can be used to calculate multiple metrics to evaluate the accuracy of screening measures.
Sensitivity, specificity, positive predictive value, and negative predictive value are common metrics of diagnostic accuracy. Each index ranges from 0 to 1, with values closer to 1 indicating better performance. Sensitivity is the proportion of students correctly identified as at risk for future problems (TP / [TP + FN]). Specificity is the proportion of students correctly identified as not at risk for future problems (TN / [TN + FP]). Positive predictive values (TP / [TP + FP]) represent the proportion of students identified as at risk who were truly at risk, whereas negative predictive values (TN / [TN + FN]) represent the proportion of students identified as not at risk who were truly not at risk. Unlike sensitivity and specificity, the proportion of students who fail the criterion test (i.e., base rate) heavily influences positive and negative predictive values. Recommended values for each of the aforementioned indices vary. Given the cost of FNs (i.e., a student does not receive needed intervention), researchers have generally prioritized high sensitivity (e.g., .85–.90) at the cost of small decreases in specificity (Johnson, Jenkins, Petscher, & Catts, 2009). There is some consensus that specificity values above .70 are desired (Kilgus, Methe, Maggin, & Tomasula, 2014).
Although useful, these four diagnostic accuracy statistics do not inform probabilistic statements of success or failure for individual students. Posttest probabilities may provide a more useful picture of the accuracy of screening decisions (VanDerHeyden, 2011, 2013). Posttest probabilities represent the probability of failing the criterion after the results of a screening test are known. A positive posttest probability indicates the probability that a student who failed the screener will fail the criterion test. A negative posttest probability indicates the probability that a student who passes the screener will fail the criterion test. Calculating posttest probabilities requires estimation of the pretest probability of failing the criterion test (based upon historical data or observed base rates) and likelihood ratios derived from the sensitivity and specificity of screening measures (Deeks & Altman, 2004). More specifically, the pretest probability of failure is multiplied by either the positive likelihood ratio (i.e., sensitivity / [1 − specificity]) or the negative likelihood ratio (i.e., [1 − sensitivity] / specificity). Existing recommendations suggest that positive posttest probability values should exceed .50 and negative posttest probability values should fall below .10 (VanDerHeyden, 2013).
Screening in Upper Elementary Grades
Most research on universal screening in education has focused on students in early elementary grades. However, predicting student proficiency in upper elementary and middle school grades (i.e., Grades 3 through 8) requires unique considerations. First, the focus in upper grades is often on remediation rather than prevention (L. S. Fuchs, Fuchs, & Compton, 2010). Educators in upper elementary grades may be acutely interested in predicting student proficiency on year-end statewide achievement tests (Espin, Wallace, Lembke, Campbell, & Long, 2010). More specifically, screening measures are used to predicting performance in 6 to 9 months rather than predicting reading difficulties 1 or more years later. Second, reading instruction moves from providing instruction in foundational skills to applying comprehension skills to make sense of increasingly more complex texts (National Governors Association Center for Best Practices & Council of Chief State School Officers, 2010). In turn, to predict proficiency on end-of-year tests, screening measures in upper elementary grades may need to assess a wider range of more complex skills than those assessed in earlier grades on universal screening measures.
We found nine studies that examined the diagnostic accuracy of reading screening measures in upper elementary or middle school grades (Baker et al., 2015; Decker, Hixson, Shaw, & Johnson, 2014; Denton et al., 2011; McGlinchey & Hixson, 2004; Nese, Park, Alonzo, & Tindal, 2011; Shapiro, Solari, & Petscher, 2008; Speece et al., 2010; Stage & Jacobsen, 2001; Stevenson, Reed, & Tighe, 2016). Several trends emerged from this body of literature. First, researchers used high-stakes state assessments as the criterion measure in all but Speece et al. (2010). The cut-score was generally set to predict proficient (compared with below proficient) performance on these tests. Notably, we found no studies that used the newer Common Core State Standards–aligned tests as a criterion measure. Second, paper-based screening measures administered in individual (e.g., measures of oral reading fluency [ORF], MAZE) or group formats (e.g., Multiple Choice Reading Comprehension test) were more common than computer adaptive tests or teacher ratings. Third, researchers used publisher-provided cut-scores to establish risk on screening tools more commonly than local norms or statistically derived cut-scores. This was consistent with research on screening practices in applied settings (Mellard, McKnight, & Woods, 2009). Finally, single screening measures did not result in recommended diagnostic accuracy values with one notable exception (Shapiro et al., 2008). That is, single screening measures—when interpreted using publisher-provided cut-scores—tended to produce diagnostic accuracy results below the quality recommended for universal screening.
Using Multiple Screening Measures
Due in part to the suboptimal diagnostic accuracy of single screening tests, researchers have also examined the relative benefits of using multiple screening measures (L. S. Fuchs & Vaughn, 2012). Interpretation of multiple screening measures can be difficult. The two primary approaches for deploying and interpreting multiple screening measures include multivariate and gated-screening approaches.
Multivariate screening methods combine information from multiple measures collected at one time. In this approach, data from multiple screening measures are entered into either a multiple logistic regression model to predict proficiency (Pass/Fail) on the end-of-year test or a multiple regression model to predict performance on the end-of-year test. After selecting the most parsimonious model, the predicted probabilities of attaining proficiency on the end-of-year test (for multiple logistic regression models) or composite scores (for multiple regression models) for each student are entered into a receiver operating curve (ROC) analysis to determine a cut-score with optimal sensitivity and specificity values (Catts, Fey, Zhang, & Tomblin, 2001).
In upper grades, screening batteries that included measures of fluency, vocabulary, and comprehension have outperformed single screening measures (Baker et al., 2015; Decker et al., 2014; Shapiro et al., 2008; Speece et al., 2010). Some researchers observed a point of diminishing returns when adding measures to multivariate screening frameworks. For example, Speece and colleagues (2010) found that batteries consisting of two or three measures outperformed batteries with more measures.
As an alternative to delivering multiple screening measures to all students at the same time and using analytic methods to derive estimated probabilities or composite scores, gated screening represents a decision tree that prompts educators to evaluate student performance on a single screening measure before engaging in additional testing. In the gated-screening model, the first test is used primarily as a “rule out” mechanism (i.e., to identify students with little risk of future difficulty) and additional tests are used to “rule in” students for additional support. We found no published studies of gated screening for upper elementary students. Research conducted in early elementary grades suggested that collecting progress-monitoring data for students scoring below criterion on the first measure can improve the accuracy of screening decisions, whereas static follow-up measures did not improve classification accuracy (Compton et al., 2010; Compton, Fuchs, Fuchs, & Bryant, 2006).
Efficiency
Researchers have primarily focused on the relative improvements in diagnostic accuracy for multivariate or gated-screening approaches. Yet, deploying additional universal screening measures may not always improve diagnostic accuracy to a level that justifies the use of additional instructional time (Clemens, Keller-Margulis, Scholten, & Yoon, 2016; VanDerHeyden, 2013). Therefore, when evaluating the relative value of using multiple measures to make universal screening decisions, one should also consider the impact of those procedures on the efficiency of universal screening. In practice, educators must carefully consider if the collection of additional assessment data is worth the associated burden of expending additional resources for testing (Jenkins, Hudson, & Johnson, 2007). If multivariate batteries and gated-screening approaches resulted in similar diagnostic accuracy, a major benefit of gated screening is the reduction in the number of students assessed with a full screening battery (i.e., only students who are flagged by the first test participate in additional testing).
Purpose
The purpose of this study was to examine the diagnostic adequacy and efficiency of three commonly used screening measures (i.e., a computer adaptive test, a curriculum-based measure [CBM], and an informal reading inventory) across three approaches to collecting screening data: (a) using a single screener, (b) using a multivariate screening battery, and (c) using a gated-screening process. The criterion measure was a summative, end-of-year state test that is aligned with the Common Core State Standards. We conducted a retrospective analysis of extant data from students in Grades 3 through 5 to answer the following research questions:
Method
Participants
We retrospectively analyzed data collected during the 2014-2015 school year in a large suburban district (Institute of Education Sciences [IES], 2006) in Wisconsin. All five elementary schools in the district collected universal screening data as part of their tiered system of supports. Per district policy, educators collected universal screening data for all students unless the student had an Individualized Education Plan (IEP) that designated an alternative assessment. There were no other exclusion criteria.
A total of 1,413 students in Grades 3 to 5 were enrolled in the district at some point during the 2014-2015 school year. In this sample, 71.3% of the students identified as White, 15.9% identified as Asian, 6.3% identified as Latino, 3.7% identified as two or more races, 2.3% identified as Black, and less than 1% identified as American Indian/Alaska Native or Native Hawaiian or other Pacific Islander. Approximately 9% of the students received free or reduced-price lunch, 6% received limited English proficiency services, and 9% received special education services.
We excluded a total of 106 students (35 in Grade 5, 46 in Grade 4, and 25 in Grade 5) from the analytic sample due to missing data. There were 46 students who were missing data on the outcome measure. An additional 20 students were missing data on all three screening measures and 40 students were missing data on one screening measure. Student race/ethnicity (χ2 = 4.46, p = .486), sex (χ2 = .750, p = .386), free or reduced-price status (χ2 = .333, p = .564), and limited English proficiency status (χ2 = 4.10, p = .129) were not significantly associated with missing data. Special education status was significantly associated with missing data (χ2 = 21.14, p < .001). Students who were missing data from one or more screening measures scored higher on the outcome measure in Grades 3 through 5; however, the difference was statistically significant in third grade only, F(1, 433) = 5.05, p = .025. The final analytic sample (n = 1,307) included 414 students in Grade 3, 406 students in Grade 4, and 487 students in Grade 5.
Measures
Smarter Balanced summative assessment
The criterion measure was the Wisconsin version of the Smarter Balanced Assessment Consortium (SBAC) summative assessment in English and Language Arts. The SBAC assessment is a computerized adaptive test aligned with the Common Core State Standards (SBAC, 2016). According to the test publisher, most students completed the untimed test in approximately 90 min. The English and Language Arts test measures reading of both literary and informational text, writing, listening skills, and researching topics to investigate, integrate, and present information. Examples of reading standards across the three grade levels are shown in Table 1. Wisconsin standards also include foundational literacy skills such as including knowing and applying grade-level word analysis skills to decode words and reading grade-level texts with sufficient accuracy and fluency to support comprehension. Items include a combination of multiple choice, short answer, and performance tasks.
Examples of Wisconsin Reading Standards.
The SBAC (2016) published evidence supporting the technical adequacy of the 2014-2015 version of the assessment. Specifically, the SBAC presented data to support the content, internal structure, response processes, criterion, and predictive validity of the tool. For example, the SBAC presented evidence documenting test design procedures, scaling and equating procedures, and validation of the vertical alignment scaling. The SBAC also provided multiple sources of data that suggest the test yields sufficiently reliable and precise estimates of student ability at each testing occasion. Marginal reliability coefficients for the English and Language Arts scaled score exceeded .90 for Grades 3 through 5. Reported overall standard errors of measurement values were 27.7, 26.4, and 26.0 for Grades 3 through 5, respectively. Finally, the thresholds derived to assess student proficiency demonstrated consequential validity in predicting future performance and later outcomes (SBAC, 2016).
Student performance on the SBAC summative assessment was measured using a continuous, scaled score that ranges from 2,000 to 3,000. The SBAC also reports achievement levels that indicate student understanding of grade-level concepts. Achievement levels include below basic (i.e., minimal understanding), basic (i.e., partial understanding), proficient (i.e., adequate understanding), and advanced (i.e., thorough understanding). The Wisconsin Department of Public Instruction adopted the SBAC achievement levels for the 2014-2015 school year. We used cut-scores that differentiated between basic (i.e., minimal understanding) and proficient achievement (i.e., adequate understanding). Cut-scores were 2,432 in the third grade, 2,473 in the fourth grade, and 2,502 in the fifth grade.
Measures of Academic Progress
The Measures of Academic Progress (MAP; Northwest Evaluation Association [NWEA], 2011) are computer adaptive interim assessments designed to measure student achievement longitudinally. The test developer recommends administering the MAP 3 to 4 times per year as a universal screening tool. The MAP includes multiple choice and short answer tasks. According to the technical manual (NWEA, 2011), estimates for internal consistency for the MAP reading test range from .70 to .86 across grades. Concurrent validity estimates with various state tests range from .57 to .82. In this sample, fall to spring test–retest reliability was r = .81, r = .83, and r = .83 in Grades 3 through 5, respectively.
MAP reading assesses students’ ability to comprehend literature and informational text as well as students’ foundational reading skills and vocabulary knowledge. Student performance is measured on a continuous, equal-interval vertical scale (NWEA, 2011). Scores range from 100 to 350 and are comparable across grade levels. According to the 2015 national norms (Thum & Hauser, 2015), the average MAP reading performance in fall was 188.3 (SD = 15.9) in the third grade, 198.2 (SD = 15.5) in the fourth grade, and 205.7 (SD = 15.1) in the fifth grade.
In 2015, the NWEA published a technical document that identified MAP cut-scores that corresponded to the Smarter Balanced summative assessment achievement levels. We used the fall MAP cut-score that, according to the NWEA (2015), differentiated between basic and proficient performance on the SBAC summative assessment. Cut-scores were 194 in the third grade, 203 in the fourth grade, and 209 in the fifth grade.
ORF
ORF was measured using grade-level AIMSweb probes. Student performance was expressed as the median number of words read correctly per minute (WRCM) from three benchmark passages. The AIMSweb technical manual (Pearson Inc., 2012) indicates alternative form reliability and test–retest reliability of r = .94. In this sample, fall to spring test–retest reliability was r = .84, r = .85, and r = .89 in Grades 3 through 5, respectively. There is evidence of moderate relationships between AIMSweb probes and state-test performance, norm-referenced reading assessments, and MAP scores (Pearson Inc., 2012). AIMSweb developed cut-scores using data from 32,000 students in 20 states (Pearson Inc., 2011). According to the test developer, student scores at the 45th percentile were generally associated with an 80% success rate of passing the state tests. This study did not include the more recent SBAC tests; therefore, we set the cut-score as the 45th percentile based on the AIMSweb national norms. The cut-scores were 82 WRCM in the third grade, 103 WRCM in the fourth grade, and 115 WRCM in the fifth grade.
Running records (RR)
The Teachers College Reading and Writing Project (2014) RR were also used as a screening measure. There are assessment systems corresponding with Levels A to K and Levels L to Z. According to published administration guidelines, evaluators assign independent reading levels (denoted by letters) based on students’ accuracy, fluency, expression, and comprehension (Teachers College Reading and Writing Project, 2014). Comprehension is estimated using oral retell and four open-ended comprehension questions. RR assessments took approximately 15 to 30 min. We converted RR levels to numerical values for all analyses (i.e., A = 1, B = 2, etc.).
We found no published reliability or validity evidence for RR scores. In this sample, test–retest reliability from fall to spring was r = .83, r = .90, and r = .82 in Grades 3 through 5, respectively. Students’ performance on the RR was strongly correlated with the MAP (r = .77), which provides some evidence of criterion-related validity. The publisher provided cut-scores based on the average RR levels of the majority of readers who achieved that score on the New York state accountability test (Teachers College Reading and Writing Project, 2012). In the absence of other guidance, we used the publisher-recommended cut-scores that differentiated between proficient (i.e., met grade-level standards) and partially proficient achievement (i.e., did not meet grade-level standards). Cut-scores were Level M (13) in the third grade, Level P (16) in the fourth grade, and Level S (19) in the fifth grade.
Procedure
School district staff administered the screening measures in September and October of 2014, and students completed the state test in April of 2015. Universal screening data were collected as part of the ongoing multitiered system of support used in the district. Educators used these data to identify students needing supplemental supports. School staff administered the MAP assessment following district protocols, with students typically completing the assessment in 30 to 90 min per class. A district assessment team administered three curriculum-based measurement of reading (CBM-R) benchmarking probes to each student using standardized AIMSweb procedures. The administration time for CBM-R assessments was approximately 5 min per student. Classroom teachers administered the RR to students in their classrooms with each assessment taking approximately 15 to 30 min per student. A reasonable estimate for the entire screening battery is 50 to 125 min. We provide estimates for the various combinations of measures in Table 2. No fidelity data or interobserver agreement data were available for any measure.
Estimated Time Spent on Assessment for Each Screening Method for One Screening Period.
Note. Class size was based on the district average. The number of students needing additional assessment was based on the MAP proficiency rates for students in this study. MAP = Measures of Academic Progress; CBM-R = curriculum-based measurement of reading; RR = running records.
The district required that literacy instruction occur for 90 min per day. School staff identified students in need of supplemental interventions based on student performance on screening measures. Educators implemented reading interventions with 63 students (5% of total) including 30 third-grade students (7%), 20 fourth-grade students (5%), and 13 fifth-grade students (3%). Data regarding student performance during the intervention were not available. However, 50 of the 63 students who received interventions did not demonstrate proficiency on the state test. This suggests that existing supplemental supports were not highly influential on diagnostic accuracy outcomes (i.e., increased number of FPs).
Data Analysis
We used extant data to examine the accuracy of three types of screening processes. First, we examined the diagnostic accuracy for decisions based upon a single screening measure. Second, we examined the use of multivariate screening batteries (i.e., combinations of the three screening tests). Third, we examined decisions based on a simulated gated-screening procedure. In a gated-screening approach, students who fail the first screening measure receive further assessment. All students in this sample completed the three screening assessments during the fall. To simulate the gated process, we analyzed the results of the MAP first. If a student scored above the MAP benchmark, we identified the student as not at risk without considering any additional information. If a student failed the MAP, we used additional assessment information (e.g., ORF, RR) to determine the student’s risk status. We considered students at risk if they failed all additional screening measures in the model.
We conducted all statistical analyses using R (R Core Team, 2015). First, we obtained descriptive and correlational statistics for each measure across each grade. Second, we applied the publisher-recommended cut-scores for evaluating the performance of single measures and gated-screening models. We used logistic regression to analyze the multivariate batteries, which requires deriving cut-scores based on the predicted probability values for each student. To choose an optimal balance of sensitivity and specificity, we calculated cut-scores for the logistic regression models using the OptimalCutpoints package for R (Lopez-Raton, Rodriguez-Alvarez, Suarez, & Sampedro, 2014). Third, we calculated sensitivity, specificity, negative predictive values, and positive predictive values for each measure. We estimated approximate confidence intervals (CIs) for the sensitivity and specificity values using the formula for the standard error of a proportion (Harper & Reeves, 1999):
where p is the sensitivity or specificity, q is 1 minus p, and n is the sample size. Fourth, we calculated the positive and negative posttest probabilities for each screening method. We estimated pretest odds based on the prevalence of failure on the outcome measure.
Results
Descriptive Statistics
Students in the district demonstrated high achievement on all measures in all three grades (see Table 3). Average student performance on the Smarter Balanced assessment in the present sample was above the 67th percentile for the state in all grades. Approximately 25% of students did not perform in the proficient range on the state test in Grades 3 and 4, and approximately 17% of students did not perform in the proficient range in Grade 5. On average, student performance was at or above the 69th percentile across all three grades for MAP when compared with published national norms (Thum & Hauser, 2015). Similarly, the average ORF score was at or above the 66th percentile based on the AIMSweb national norms. Published national norms for the Teachers College Running Records do not exist. However, the average student performance on the RR was above the publisher-recommended cut-score for each grade (Teachers College Reading and Writing Project, 2012). Visual analysis of histograms and the resulting skew and kurtosis values indicated that scores from each measure reasonably approximated a normal distribution.
Descriptive and Correlational Statistics for Each Measure.
Note. All correlations significant at the p < .01 level. Sample size for each grade was as follows: third = 414, fourth = 406, and fifth = 487. SB = Smarter Balanced; MAP = Measures of Academic Progress; CBM-R = curriculum-based measurement of reading; RR = running records; ORF = oral reading fluency.
Correlations between MAP scores and Smarter Balanced assessment scores were strong in all three grades (see Table 3). There were moderate correlations between the ORF scores and the Smarter Balanced assessment with the lowest correlation found in Grade 5. Correlations between the RR scores and the Smarter Balanced assessment scores followed a similar pattern. Despite the differences in format, there were strong correlations between the three screening measures (r > .70) in the third and fourth grades. There were moderate correlations between the screening measures in the fifth grade.
Question 1: Diagnostic Accuracy of Single Measure and Multivariate Screening Procedures
Single-test screening
The sensitivity and specificity of MAP (.63, 95% CI = [.58, .68]; .83, 95% CI = [.79, .87]) and RR (.66, 95% CI = [.61, .71]; .84, 95% CI = [.80, .88]) scores were similar in the third grade (see Table 4). MAP scores had the highest sensitivity in the fourth (.86, 95% CI = [.83, .89]) and fifth (.75, 95% CI = [.71, .79]) grades. ORF scores had the lowest sensitivity values (range = .41–.61) but the highest specificity values in each grade (range = .88–.90). In the third grade, RR and MAP resulted in a similar balance of sensitivity and specificity. However, MAP had the best balance of sensitivity and specificity in the fourth and fifth grades. With the exception of MAP in the fourth grade, the single screening measures failed to demonstrate acceptable sensitivity and specificity.
Diagnostic Accuracy Results for Reading Performance.
Note. Multivariate analyses were conducted using logistic regression. BR = base rate of failure; TP = true positive; FP = false positive; TN = true negative; FN = false negative; SE = sensitivity; SP = specificity; CI = confidence interval; PPV = positive predictive value; NPV = negative predictive value; MAP = Measures of Academic Progress; CBM-R = curriculum-based measurement of reading; RR = running records.
Estimated posttest probabilities are shown in Table 5. In the third grade, all three measures had positive posttest probabilities above the .50 benchmark, but negative posttest probabilities were larger than the recommended benchmark of .10. Only MAP scores in the fourth grade met criteria for positive or negative posttest probability. None of the measures in the fifth grade met the recommended criteria for positive posttest probability; however, RR and MAP scores met criteria for negative posttest probability. CBM generally had higher positive and negative posttest probabilities than RR scores.
Posttest Probability Results by Screening Measure and Grade.
Note. Multivariate analyses were conducted using logistic regression. BR = base rate of failure; Pretest odds = pretest odds of failure; Pos LR = positive likelihood ratio (SE / [1 − SP]); Neg LR = negative likelihood ratio ([1 − SE] / SP); +PP = positive posttest probability; −PP = negative posttest probability; MAP = Measures of Academic Progress; CBM-R = curriculum-based measurement of reading; RR = running records; SE = sensitivity; SP = specificity.
Multivariate screening
We examined the added benefit to diagnostic accuracy associated with a multivariate battery using a series of sequential logistic regression models (see Tables 4 and 5). It is important to note that we derived cut-scores for resulting probabilities. We evaluated three possible combinations: (a) MAP and CBM, (b) MAP and RR, and (c) MAP, CBM, and RR. The diagnostic accuracy of the multivariate combinations was similar across all grades; therefore, we focus on differences from the single measures.
In the third and fifth grades, the multivariate batteries resulted in substantial increases to sensitivity and minor decreases in specificity compared with the single screening measures (see Tables 4 and 5). For example, the combination of MAP and CBM in the third grade improved the sensitivity by .13 and .28, and decreased the specificity by .08 and .09, from MAP or CBM alone. Results slightly favored the combination of MAP and RR in the third grade (although none of the combinations demonstrated high sensitivity) and MAP and ORF in the fifth grade.
A different pattern emerged in the fourth grade. More specifically, the multivariate batteries improved the sensitivity from RR (.13 or .14) or CBM alone (.23 or .24) with minimal changes to specificity. It appears that the improvements were largely due to the inclusion of the MAP scores. In fact, the multivariate models resulted in a slight decrease in the sensitivity (.01 or .02) and a slight increase in the specificity (.01 or .02) from MAP scores alone.
When examining the posttest probabilities, there did not appear to be any meaningful differences between the multivariate batteries. In the third and fourth grades, the multivariate batteries met the criterion proposed by VanDerHeyden (2013) for positive and negative posttest probabilities. The combination of MAP and RR or MAP and CBM improved the negative posttest probability from MAP alone in the third grade. However, the addition of CBM or RR scores did not appreciably improve the posttest probabilities derived with the MAP scores in the fourth or fifth grade.
Question 2: Diagnostic Accuracy of Gated-Screening Procedures
Our second research question examined the diagnostic accuracy of a gated-screening approach. We used the publisher-recommended cut-scores throughout the analyses. To simulate a gated approach, we analyzed the MAP results first. For students who performed in the at-risk range on the MAP, we analyzed (a) CBM scores, (b) RR scores, or (c) CBM and RR scores. We identified students as at risk if they performed in the at-risk range on all the measures under consideration. Full results are shown in Tables 4 and 5.
The impact of gated screening on sensitivity and specificity was similar across all grades. Gated-screening models led to larger numbers of FNs, which was reflected in substantial decreases in sensitivity from using MAP or RR alone, as well as multivariate screening batteries. A gated procedure of testing all students with MAP, followed up by administering RR to students who failed the MAP, resulted in higher sensitivity than ORF alone (range = +.01 to +.13). However, the increased sensitivity failed to reach acceptable levels. Using MAP and RR had the highest sensitivity among the gated approaches in each grade. Gated approaches using MAP and CBM or all three measures resulted in unacceptable sensitivity in the third (.34 and .36) and fifth grades (.39 and .33).
The gated-screening models resulted in substantial increases in specificity from the single measure or multivariate screening batteries (>.90; see Table 4). There were fewer FPs at the cost of increasing the numbers of FNs. Administering the MAP and then CBM and RR to students who failed the MAP resulted in the highest specificity. However, the differences between the models with CBM or MAP alone were relatively small.
A similar pattern emerged when considering the posttest probabilities. Across all three grades, the gated-screening procedures resulted in higher positive posttest probabilities than the single measure or multivariate models. Administering the CBM and RR to students who failed the MAP resulted in the highest positive posttest probabilities. On the contrary, using any of the gated models resulted in inferior negative posttest probabilities relative to the single measures or the multivariate batteries. Only the use of MAP and then RR in the fifth grade resulted in acceptable positive and negative posttest probabilities based on VanDerHeyden’s (2013) criteria. Taken together, the probability of a student going on to fail the state test after being identified as at risk using the gated-screening procedures was high (i.e., .64–.84), but the probability of a student who was not flagged by the gated-screening procedures going on to fail the state test was generally above an acceptable threshold (.09–.19).
Discussion
We conducted a retrospective analysis of the diagnostic accuracy of three common screening tools used in upper elementary grades. We also examined the differential utility of three multivariate batteries and three gated-screening models. Results suggest that the most accurate method differed by grade. In the third grade, the multivariate battery led to the highest sensitivity while maintaining acceptable specificity; however, none of the multivariate batteries resulted in sensitivity levels that met previously recommended standards (i.e., .85–.90). In the fourth grade, MAP scores resulted in the best balance of sensitivity and specificity. In the fifth grade, MAP alone slightly underperformed relative to multivariate models.
Sufficiency of Single Screening Measures
Schools commonly employ one universal screening measure (Jenkins et al., 2013; Prewett et al., 2012), despite previous research indicating that single measures rarely result in acceptable diagnostic accuracy. Results of this study corroborate and extend previous research. CBM resulted in high specificity but unacceptable sensitivity for use in schools. Results for CBM scores are generally consistent with previous findings of suboptimal diagnostic accuracy values in upper elementary grades (e.g., Denton et al., 2011; Nese et al., 2011; Stage & Jacobsen, 2001). One explanation for this finding could be that the predictive validity of oral reading rate tends to decrease as students mature (Yovanoff, Duesbery, Alonzo, & Tindal, 2005). Another explanation could be that the publisher-recommended cut-scores were inappropriate for the high-achieving nature of this sample. Baker and colleagues found higher sensitivity (.76 and .79) and specificity (.79 and .85) when statistically deriving optimal cut-scores for students in Grades 7 and 8, respectively.
We found no studies examining the use of informal reading inventories (such as RR) as screening measures in upper elementary grades. Similar to CBM, RR scores alone did not meet recommended diagnostic accuracy values. However, RR did appear to outperform CBM scores when considering sensitivity and specificity together.
MAP scores demonstrated the most promise as a single screening measure in the fourth and fifth grades. Given the similar results observed for MAP and multivariate models in Grade 5, it is plausible that MAP alone would have been sufficient if we statistically derived a cut-score that set sensitivity at an acceptable level (≥.90). This is somewhat unsurprising as the skills measured by the MAP had the highest overlap with the SBAC assessment, and NWEA conducted preliminary research to align MAP performance with the results from the field test of the SBAC (NWEA, 2015).
Important differences emerged when comparing the results in the third grade with the results in the fourth and fifth grades. None of the single measures approached acceptable diagnostic accuracy for third-grade students. A potential hypothesis for the observed differences between grades relates to the importance of foundational reading skills such as fluency. Previous research suggests that the relationship between fluency and comprehension wanes in upper elementary grades (Jenkins & Jewell, 1993; Silberglitt, Burns, Madyun, & Lail, 2006). More research is needed to compare the utility of computer adaptive tests with other measures in upper elementary and middle school grades.
Multivariate Versus Gated-Screening Approaches
In the third and fifth grades, combining CBM or RR scores with the MAP in a multivariate approach increased the sensitivity of the screening decisions from single measures in isolation. In particular, the combination of MAP and CBM appeared to be the most favorable screening approach after considering the similar diagnostic accuracy and the enhanced efficiency. The marked improvement in diagnostic accuracy associated with the multivariate batteries provides some evidence for the promise of that approach in schools. However, there are two important limitations of multivariate batteries. First, the use of multiple measures requires devoting additional time to assess all students. Second, interpreting information from multiple screening tests can be difficult and requires the use of complex statistical analyses. A more efficient approach might include a gated-screening process in which a student participates in additional assessments only after performing an at-risk range on the first measure. We simulated this process using MAP as the first measure followed by one or both individually administered measures.
Consistent with previous studies of gated-screening procedures (e.g., Compton et al., 2010; Van Norman, Nelson, & Klingbeil, 2016), administering additional measures increased the specificity by reducing the number of FPs. Students identified as at risk by the gated-screening procedure had a high probability of going on to fail the state test. Yet, gated-screening approaches lowered sensitivity and resulted in a large number of FNs. Our use of publisher-provided cut-scores could have affected the observed sensitivity values. In previous studies of gated screening, researchers held sensitivity constant at an acceptable level by statistically deriving cut-scores (e.g., Compton et al., 2010). We used cut-scores provided by the publisher to reflect an approach that is more common in practice (Mellard et al., 2009). These findings suggest that the gated-screening procedure may result in unacceptably low sensitivity values if educators do not derive optimal cut-scores.
Considerations Regarding Efficiency
Each iteration of the multivariate batteries provided generally similar results. Thus, it seems reasonable to select among the screening measures evaluated in this study based on efficiency. We estimated the instructional time necessary to use each assessment based on the estimated time required to deliver each assessment and the district average for students per class (n = 18; see Table 2). The combination of MAP and CBM scores would provide similar diagnostic accuracy relative to the models that included RR, and would have saved between 270 and 540 min of instructional time in the average classroom. The collection of CBM or RR data would not have appreciably improved universal screening decisions for fourth-grade students.
Comparisons between the multivariate and the best performing gated-screening approach (MAP and then RR) favored the multivariate approach in terms of diagnostic accuracy and efficiency. Based on the MAP performance of students in this sample, we estimated that five of 18 students in an average classroom would need additional assessment. Collecting the additional RR scores would have taken a similar amount of time (60 to 150 min) as collecting CBM data for all 18 students (90 min) at the outset. As stated above, the minimal differences between the multivariate batteries and the MAP itself in the fourth grade call the need for additional measures into question; however, this finding requires further replication.
Limitations and Future Directions for Research
These results should be interpreted in light of several limitations. First, these data were collected in a high-performing, large suburban district in Wisconsin. The district served a small percentage of students from demographic groups commonly identified as at risk for reading difficulties (i.e., students identified as Black, Hispanic, eligible for free or reduced-price lunch, or eligible for limited English proficiency services). Given the homogeneity of the sample, we did not cross-validate our results on a random subset of students. A major purpose of cross-validation is to enhance the generalizability of the findings (Clemens et al., 2016; Jenkins et al., 2007). Cross-validation using the students from this sample would not support such inferences. Instead, further replication using diverse samples with varying base rates of proficiency is needed. Prospective studies, conducted with larger samples, are necessary to further examine whether student-level characteristics moderate the relationship between screening performance and criterion performance (e.g., Baker et al., 2015; Stevenson, Reed, & Tighe, 2016).
Second, we conducted a retrospective analysis using extant data. Retrospective analyses may better reflect actual practice but are subject to multiple threats to validity (Bossuyt et al., 2003). For example, information on administration fidelity was unavailable. Any deviations from standardized administration procedures may have influenced the quality of the screening data, which may have influenced these results. We were also unable to control for intervention status. Although district records allowed us to identify students who received a reading intervention, information about the type of intervention or intervention dosage was not available. The delivery of effective interventions based on screening results may influence obtained diagnostic accuracy estimates for screening measures (i.e., increasing the number of FPs). Future prospective research could better control for such threats to validity.
Third, our use of logistic regression to assess the multivariate batteries required us to (a) save the predicted probabilities derived from the regression model and (b) enter the predicted probabilities into an ROC analysis to select a cut-score. We selected a cut-score that simultaneously optimized sensitivity and specificity rather than a cut-score that maximized sensitivity (e.g., Compton et al., 2010). We thought this approach would better reflect the use of publisher-provided cut-scores; however, this assumption could be studied further. A gated-screening approach, using optimized cut-scores, may have resulted in similar diagnostic accuracy as the multivariate approach. In fact, there is evidence that locally derived cut-scores improve the diagnostic accuracy from those specified by the publisher (e.g., Baker et al., 2015; Van Norman et al., 2016). Future research could compare the results of gated-screening and multivariate screening approaches when using optimized cut-scores for both procedures.
Fourth, the district used three common methods for collecting universal screening data. These results do not generalize to other empirically supported universal screening assessments, such as teacher ratings of academic performance, progress-monitoring data, or previous performance on state-test scores (Denton et al., 2011; Nelson, Van Norman, & Lackner, 2016; Speece et al., 2010). In this study, the high-stakes state assessment (and administration window) changed for the 2014-2015 school year. Future research is needed to determine whether these and other sources of screening information can improve diagnostic accuracy when the criterion is a Common Core State Standards–aligned test. Future work in this area could also quantify the instructional efficiency to determine whether observed improvements in diagnostic accuracy warrant losses of instructional time.
Conclusion
Educators should consider recommendations for assessment practices within the context in which they use the data. Rigid recommendations for using one screening approach over another may be harmful if educators do not consider the characteristics of the students they serve and the resources available within their system. Educators interested in evaluating their current screening procedures (e.g., single measure, multivariate, or gated screening) could use similar methods as this study. Across all methods, it seems schools may benefit from the creation of optimized cut-scores, as screening measures may not demonstrate the same diagnostic accuracy reported in the literature when publisher-recommended cut-scores are adopted. Unfortunately, it may be difficult for schools to evaluate their local screening procedures without additional training. Developing methods to help educators evaluate their local screening procedures may provide greater benefit than striving to develop universal guidelines or promoting publisher-recommended cut-scores. By increasing the feasibility of conducting local analyses, educators may be more likely to select tests, cut-scores, and procedures that meet psychometric criteria and fit the contextual realities of their schools.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
