Evaluation of the DIBELS (Sixth Edition) Diagnostic System for the Selection of Native and Proficient English Speakers at Risk of Reading Difficulties

Abstract

This comprehensive evaluation of the Dynamic Indicators of Basic Early Literacy Skills Sixth Edition (DIBELS6) set of measures gives a practical illustration of signal detection methods, the methods used to determine the value of screening and diagnostic systems, and offers an updated set of cut scores (decision thresholds). Data were drawn from a sample of 13,507 English-proficient students in kindergarten through Grade 3, with more than 4,500 students per grade level. Results indicate that most DIBELS6 measures accurately predict comprehensive test performance and that previously published decision thresholds for DIBELS6 are generally appropriate with some key exceptions. For example, the performance of phoneme segmentation fluency did not always meet expectations. The revised DIBELS6 decision thresholds can satisfactorily identify students who may require additional supports.

Keywords

evaluation of diagnostic systems signal detection theory Dynamic Indicators of Basic Early Literacy Skills

Screening instruments have risen to prominence in education due to the need to identify students as being at risk of poor reading and other outcomes. The practical benefits of universal screening include efficient measurement and the opportunity to prevent more serious deficits. Screening systems can help teachers make more efficient and effective instructional decisions (e.g., Stecker, Fuchs, & Fuchs, 2005) and reduce disproportionality in special education referrals (Marston, Muyskens, Lau, & Canter, 2003). This article uses signal detection methods to evaluate and select optimal cut scores (decision thresholds) for Dynamic Indicators of Basic Early Literacy Skills (DIBELS) Sixth Edition (DIBELS6, Good & Kaminski, 2002) using a large sample of English-proficient students from kindergarten through Grade 3.

The complete evaluation of the DIBELS6 diagnostic system involves a description of the overall accuracy of the measures and a selection of specific scores that optimally identify students with reading difficulties. An evaluation of DIBELS6 measures is needed because schools continue to utilize the DIBELS6 materials, despite the release of DIBELS Next. The data system at the University of Oregon (https://dibels.uoregon.edu/), for example, included DIBELS6 scores for approximately 336,000 students in 2,100 schools from the 2013-2014 school year; many more have been assessed with DIBELS6 by schools that use a different or no data system. The wide use of DIBELS6 and its highly publicized benchmark goals for decision making have continued without published, peer reviewed works that examine cut points for all measures from kindergarten to Grade 3. Finally, since the release of DIBELS Next in 2010, no peer reviewed research had evaluated the goal-setting procedures or thresholds. A search of the PsycINFO database from 2010 to 2015 produced four publications that reported on DIBELS Next measures, and none explored the topic of decision thresholds.

This article presents the first thorough evaluation of DIBELS6 as a diagnostic system and offers accuracy standards that competing screening systems would ideally exceed. With 13,507 students proficient in English, 4,600 to 5,600 per grade, our analysis offers high precision and reduced potential confounding related to English language ability.

Prevention-Oriented Universal Screening

Universal screening for risk in the domain of educational achievement was relatively novel at the turn of this century, although screeners were common in special education. The more widespread use of screening in regular education was catalyzed by federal legislation, based in part on research showing that achievement deficits became increasingly intractable after Grade 4 (Juel, 1988). Universal screening was codified in a practice guide distributed by the Institute of Education Sciences (Gersten et al., 2008) that lists universal screening as the first step in any Response to Intervention model. Key features include assessment of all students at the beginning and middle of each school year, measures that efficiently capture essential academic skills, and empirically derived cut scores that accurately identify students at risk of poor academic outcomes. The primary aim of cut scores is to improve the objectivity and consistency of decisions about potential deficits and allow for remediation as early as possible. Many screening systems, however, do not have published, empirically evaluated decision thresholds conducted with appropriate methods (Smolkowski & Cummings, 2015).

Evolution of DIBELS

Good, Simmons, and Kame’enui (2001) described the use of reading measures that would align to the critical areas in beginning reading (see also Kaminski & Good, 1996, 1998) and predict end-of-year state test performance. Their results underscored the importance of screening all students, an early call to the use of predictive cut scores in the fall and winter as a way to identify students in need of instructional support. The methods in their article were not new (cf. Swets, 1973) but transformed how teachers used assessment data for decisions.

The use of DIBELS in schools has since evolved to include, by conservative estimates (Cummings, Park, & Bauer Schaper, 2012), participation of nearly one in six U.S. public school students in kindergarten through Grade 3. Widespread use warrants an updated investigation of the optimal cut scores for DIBELS6. The basic procedures used to determine the initial decision thresholds in Good et al. (2001) and Good, Simmons, Kame’enui, Kaminski, and Wallin (2002) were determined by multiple and sometimes ambiguous criteria. For example, Good et al. (2001) specified that all Grade 1 students should read 40 correct words per minute by the end of the year, a standard that served as the anchor for the system. The choice of this standard, however, was not empirically derived. The authors stated that “a second criterion of an effective goal is rigor or ambitiousness” (Good et al., 2001, p. 267), but the criteria used to translate this goal into cut scores were unclear.

Without knowing the precise criteria for “healthy” performance, it is impossible to gauge the quality of any screener or its cut points (Smolkowski & Cummings, 2015). Good et al. (2001) and Good et al. (2002) often used a later DIBELS6 assessment as a criterion-referenced tool to distinguish the two underlying populations: typically achieving students and students with reading difficulties. This dependence between the diagnostic system under evaluation and the criterion measure, however, likely inflates the appearance of accuracy and may therefore bias decision thresholds (Smolkowski & Cummings, 2015). Good et al. also characterized the overall accuracy of measures with correlations mostly between different DIBELS6 measures collected at different times. Finally, Good and colleagues’ (2001) evaluation of scatterplots essentially relied on predictive values to confirm chosen decision thresholds, which depend on the relative number of students in the samples used to represent each of the two populations of interest (i.e., base rate; Smolkowski & Cummings, 2015).

Research Aims

The present analyses evaluate the ability of DIBELS6 to predict reading problems determined by comprehensive, end-of-year reading assessments in kindergarten through Grade 3. These analyses extend prior research by using external reading tests as criterion measures, drawing on many students from 34 schools across multiple districts, examining only students proficient in English to reduce the potential influence of language proficiency, and providing new decision threshold recommendations. We also introduce the concept of target performance on DIBELS measures. Current thresholds for reading screeners determine levels of risk for reading difficulties. It is natural, however, to anchor expectations on the highest cutoff available (e.g., benchmark; Jacowitz & Kahneman, 1995)—even if it represents minimally acceptable performance. This may lead some teachers to believe that students who meet the highest threshold read at a satisfactory level. As we demonstrate below, some students just above the benchmark threshold may still achieve below the 40th percentile on comprehensive tests. We therefore offer teachers a target performance level that corresponds to a degree of proficiency at which students are likely to meet or exceed standards set by districts or states.

The article addresses four research questions. First, can DIBELS6 measures accurately predict risk levels on comprehensive reading measures administered in the spring of the same year? Second, what are optimal decision thresholds, and do they offer acceptable classification performance? Third, how do optimal thresholds compare with previously established cut scores? Finally, can we produce a target level of performance that teachers can use to motivate a higher level of overall reading skill?

Method

Each Oregon Reading First (ORRF) school collected DIBELS6 data from 2003-2004 to 2005-2006 and administered the Stanford Achievement Test–10th Edition (SAT10; Harcourt Educational Measurement, 2002) at the end of kindergarten and Grades 1 and 2 and the Oregon Assessment of Knowledge and Skills (OAKS; Oregon Department of Education [ODE], 2008) section on reading and literature at the end of Grade 3.

Participants and Setting

This study included students from 34 Oregon schools funded in the first cycle of Reading First (see Baker et al., 2008; Baker et al., 2011; Fien et al., 2008) from 16 independent school districts, half in urban areas and the rest approximately equally divided between mid-size cities and rural areas. Approximately 10% of the students received special education services, 69% of students qualified for free or reduced lunch, and 27% of third graders did not pass minimum proficiency standards on the OAKS when schools enrolled in Reading First. Across Oregon at the same time, 44% qualified for free reduced lunch and 18% of the Grade 3 students did not pass the third-grade OAKS.

Our sample included 13,507 English-proficient students who provided scores on the high-stakes criterion tests. About 32% of the students in the sample were identified as English learners. The 20% of the students who received services for limited English proficiency were removed. As 6,544 students provided data in multiple years, the sample includes a total of 20,051 criterion test scores (see Table 1 for n by grade). Half of the students, 49%, were female, and 6.7% were eligible for special education. Students fell into the following racial-ethnic categories: 57% Caucasian, 22% Hispanic or Latino/Latina, 11% African American, 5% American Indian, 4% Asian, and less than 1% Alaskan Native, Hawaiian, Pacific Islander, or “Other.”

Table 1.

Descriptive Information for English-Language-Proficient Students.

	Fall			Winter			Spring
	n	M	SD	n	M	SD	n	M	SD
K
LNF	4,981	9.2	12.2	5,367	26.6	17.3	5,594	39.6	18.1
PSF				5,366	23.4	16.0	5,594	45.9	16.0
NWF				5,360	16.7	16.4	5,595	34.0	20.7
SAT10 percentile							5,634	31.2	26.1
Grade 1
LNF	4,387	32.9	17.5
PSF	4,387	31.3	17.7	4,701	47.8	15.2	4,883	51.8	12.3
NWF	4,387	24.5	21.9	4,702	52.4	27.5	4,885	70.9	33.3
ORF				4,701	27.8	29.6	4,885	51.0	34.9
SAT10 percentile							4,953	35.5	27.9
Grade 2
NWF	4,078	57.7	32.7
ORF	4,138	41.7	31.8	4,408	69.4	39.1	4,571	86.4	39.3
SAT10 percentile							4,635	35.5	27.3
Grade 3
ORF	4,393	67.4	35.5	4,637	85.8	38.7	4,734	103.1	37.7
OAKS raw score							4,828	210.7	11.2

Note. For the OAKS, raw scores are reported; percentiles were unavailable. LNF = letter naming fluency; PSF = phoneme segmentation fluency; NWF = nonsense word fluency; ORF = oral reading fluency; SAT10 = Stanford Achievement Test–10th Edition; OAKS = Oregon Assessment of Knowledge and Skills.

In the fall, winter, and spring, students were administered DIBELS6 measures (Good & Kaminski, 2002), and 88% to 99% of students in kindergarten through Grade 3 participated in the assessments. In the spring, students were administered comprehensive reading tests; about 3% to 4% were excluded due to absences.

Criterion Measures

Stanford Achievement Test, 10th Edition

The SAT10 (Harcourt Educational Measurement, 2002) is a group-administered, norm-referenced test of overall reading proficiency. The measure is not timed. Kuder–Richardson reliability coefficients for total reading scores were .97 at Grade 1 and .95 at Grade 2. Correlations between the total reading score and the Otis–Lennon School Ability Test ranged from .61 to .74. We used the total reading score as our criterion with 2007 norms based on a representative sample of the U.S. student population.

Oregon Assessment of Knowledge and Skills (OAKS)

The OAKS, developed by the ODE (2008), is an untimed, multiple-choice test administered yearly to all Grade 3 students in Oregon. Reading passages represented literary, informative, and practical selections that students might encounter in school settings and other reading activities. Individual subtests require students to understand word meanings in the context of a selection; locate information in common resources; answer literal, inferential, and evaluative comprehension questions; recognize common literary forms, such as novels, short stories, poetry, and folk tales; and analyze the use of literary elements and devices, such as plot, setting, personification, and metaphor. ODE reported criterion validity of .75 with the California Achievement Tests and .78 with the Iowa Tests of Basic Skills. The scores from the four alternate test forms used for the OAKS demonstrated Kuder–Richardson reliability of .95.

DIBELS6

Below, we describe each measure and present technical adequacy. Table 1 indicates administration times. Dynamic Measurement Group (DMG; 2008) summarized test–retest and alternate-form reliability and concurrent and predictive validity estimates for DIBELS6 measures from 26 studies with 29 criterion tests (please consult DMG, 2008, for details).

Letter naming fluency (LNF)

LNF measures the number of randomly ordered uppercase and lowercase letters students name in 1 min. Score reliabilities ranged from .86 to .98 and validity estimates from .31 to .74 (DMG, 2008).

Phoneme segmentation fluency (PSF)

PSF measures phonemic awareness. Students are scored on the number of correct individual phonemes segmented from words read aloud by the examiner in 1 min. Score reliabilities ranged from .74 to .90 and validity coefficients from .43 to .59 (DMG, 2008).

Nonsense word fluency (NWF)

NWF measures alphabetic understanding and phonological recoding ability (Cummings, Dewey, Latimer, & Good, 2011). Students are scored on the number of phonemes they correctly identify from consonant–vowel and consonant–vowel–consonant pseudowords (either individual sounds or whole pseudowords) in 1 min. Score reliabilities ranged from .84 to .98 and validity coefficients from .33 to .82 (DMG, 2008).

Oral reading fluency (ORF)

DIBELS ORF measures fluency with connected text. Students read sets of three passages, 1 min each, and are scored on the median number of correctly read words. Score reliabilities ranged from .89 to .99 and validity estimates from .31 to .97 (DMG, 2008).

Data Collection

DIBELS measures were administered to students by school-based assessment teams in the fall, winter, and spring. Teams received 1-day trainings on DIBELS6 administration and scoring with additional calibration sessions from reading coaches at each school. Test–retest reliabilities ranged from .60 to .83 for PSF scores, .83 to .90 for NWF scores, and .93 to .97 for ORF scores.

Teachers administered the SAT10 and the OAKS each spring. SAT10 testing was monitored by Reading First coaches trained by the ORRF Center. Coaches trained teaching staff in their building on test administration and monitoring. Coaches documented testing procedures with an 18-item implementation fidelity checklist; median fidelity was 98.3%. Teachers administered the OAKS according to procedures established by the school, district, and state.

Analysis Approach

These analyses followed the methods outlined in Smolkowski and Cummings (2015). We first generated receiver operating characteristic (ROC) curves and calculated the area under curve, A, for each measure administered at each time point to evaluate overall accuracy with respect to end-of-year criterion tests. An excellent screener should produce values of A at or above .950, for a good screener, A should range from .850 to .949, and reasonable screeners yield moderate A values from .750 to .849 (Swets, 1988). Values below .75 represent relatively poor diagnostic utility. We believe teacher judgments may be more valuable than the results of a reading screener with A < .75 (Martin & Shapiro, 2011).

The selection of a decision threshold for each level of risk should depend on the anticipated consequences of four potential outcomes: false and true positives and negatives. In situations such as reading, where useful approximations of the full costs and benefits associated with outcomes are unavailable, Swets, Dawes, and Monahan (2000) suggested setting decision thresholds based on sensitivity or specificity. We set decision thresholds based on the complement of sensitivity, the false-negative fraction, so no more than 20% of students from the reading-difficulty population were incorrectly identified as typically achieving (sensitivity = .80). For most decision thresholds, this criterion produced greater sensitivity than specificity, which will allow teachers to generally capture more false positives than false negatives. We believe it is more ethical to provide supplemental instruction that some students may not require than fail to offer such instruction to students who truly need help. At-risk students with false-negative scores will not likely receive intensive supports, but they will still likely have a true-positive score for the some-risk threshold and consequently receive some supplemental instruction. Teachers might also catch typically achieving students incorrectly assigned to small-group instruction (false positive), whereas a student with reading difficulties assigned to standard instruction (false negative) may go unnoticed (Smolkowski & Cummings, 2015).

Our choice of decision thresholds also hinges on the observation that most reading screeners are not highly accurate (i.e., A < .95). Specificity seldom exceeds .80 for the thresholds chosen for sensitivity value .80. Finally, this approach to establishing decision thresholds allows for a consistent interpretation of the cut scores for all measures at all administrations, unlike more ambiguous approaches.

All analyses were conducted with SAS (SAS Institute, 2009) PROC LOGISTIC to estimate A and PROC FREQ for other statistics. For reporting, we followed the STAndards for the Reporting of Diagnostic accuracy studies (STARD; http://www.stard-statement.org/).

Results

Table 1 provides descriptive information for the SAT10, OAKS, and DIBELS6 measures. Tables 2 through 4 present the A, the decision threshold, classification statistics, base rates, ρ, and the proportion screened positive (τ) for each measure. The statistics were defined for students with criterion values at the 20th normative percentile (at risk), 40th percentile (benchmark), and 60th percentile (target) on comprehensive tests collected at the end of the same year each screener was administered. We describe the level of precision surrounding estimates of A, sensitivity, and specificity in table notes.

Table 2.

Optimal Letter Naming Fluency and Phoneme Segmentation Fluency Cut Scores.

Statistic	Letter naming fluency				Phoneme segmentation fluency
	Kindergarten			1st	Kindergarten		1st
	F	W	S	F	W	S	F	W	S
At risk
A	.77	.84	.84	.82	.79	.73	.73	.68	.60
Threshold	6	27	42	33	28	54	40	56	61
Sensitivity	.81	.81	.81	.82	.81	.80	.81	.80	.82
Specificity	.62	.71	.68	.65	.60	.46	.47	.40	.27
NPV	.81	.82	.82	.87	.80	.74	.82	.78	.72
PPV	.62	.69	.67	.56	.61	.55	.45	.43	.39
ρ	.43	.44	.45	.35	.44	.45	.35	.36	.37
τ	.56	.51	.54	.51	.58	.65	.63	.68	.77
Some risk
A	.79	.85	.85	.82	.79	.71	.71	.64	.56
Threshold	11	34	47	38	33	57	44	59	62
Sensitivity	.81	.81	.80	.81	.80	.81	.81	.81	.82
Specificity	.63	.71	.72	.64	.62	.40	.39	.32	.23
NPV	.59	.61	.60	.71	.57	.47	.60	.54	.46
PPV	.83	.87	.87	.75	.83	.76	.64	.63	.61
ρ	.70	.70	.71	.58	.70	.71	.58	.59	.59
τ	.68	.65	.65	.62	.68	.75	.73	.76	.80
Target
A	.82	.88	.86	.80	.80	.70	.69	.63	.56
Threshold	14	37	50	42	36	58	45	60	62
Sensitivity	.81	.81	.81	.81	.82	.81	.80	.82	.81
Specificity	.67	.77	.76	.60	.62	.39	.39	.30	.24
NPV	.44	.46	.45	.53	.42	.30	.40	.36	.29
PPV	.92	.94	.94	.86	.91	.86	.79	.78	.76
ρ	.82	.82	.83	.74	.82	.83	.74	.75	.75
τ	.73	.70	.71	.71	.74	.78	.75	.79	.80

Note. Thresholds based on SAT10 criterion values: 20th, 40th, and 60th percentile for at risk, some risk, and target. A represents the area under the ROC curve; NPV = negative predictive value; PPV = positive predictive value; ρ = base rate; τ = proportion screened positive (scored below threshold). Thresholds bolded if A ≥ .75. 95% confidence intervals: ±.01 for LNF A values, except for target threshold in the fall of kindergarten (±.02), and sensitivity and specificity values; ±.02 for PSF A values and ±.01 for sensitivity and specificity values, except the at-risk threshold in the fall of Grade 1 (±.02). SAT10 = Stanford Achievement Test–10th Edition; ROC = receiver operating characteristic; LNF = letter naming fluency; PSF = phoneme segmentation fluency.

Table 3.

Optimal Nonsense Word Fluency Thresholds.

Statistic	Kindergarten		1st			2nd
Statistic	W	S	F	W	S	F
At risk
A	.85	.84	.84	.87	.84	.82
Threshold	14	34	19	48	62	52
Sensitivity	.82	.81	.80	.81	.80	.81
Specificity	.72	.67	.71	.69	.71	.65
NPV	.83	.81	.87	.87	.86	.87
PPV	.69	.67	.60	.60	.61	.54
ρ	.44	.45	.35	.36	.37	.34
τ	.52	.54	.47	.49	.48	.51
Some risk
A	.87	.86	.84	.83	.82	.79
Threshold	19	39	25	54	71	62
Sensitivity	.81	.80	.81	.80	.80	.81
Specificity	.76	.73	.69	.68	.69	.61
NPV	.63	.61	.73	.71	.70	.70
PPV	.89	.88	.78	.78	.79	.74
ρ	.70	.71	.58	.59	.59	.58
τ	.64	.65	.60	.61	.60	.63
Target
A	.89	.88	.83	.82	.82	.79
Threshold	22	42	30	59	81	70
Sensitivity	.80	.80	.81	.81	.80	.80
Specificity	.81	.76	.67	.66	.67	.60
NPV	.47	.45	.55	.53	.53	.50
PPV	.95	.94	.88	.88	.88	.86
ρ	.82	.83	.74	.75	.75	.75
τ	.70	.71	.68	.69	.68	.70

Note. Thresholds based on SAT10 criterion values: 20th, 40th, and 60th percentile for at risk, some risk, and target. A represents the area under the ROC curve; NPV = negative predictive value; PPV = positive predictive value; ρ = base rate; τ = proportion screened positive. Thresholds bolded if A ≥ .75. 95% confidence intervals: ±.01 for A values, except target in the fall of Grade 2 (±.02), and sensitivity and specificity values, except for specificity for thresholds in the fall of Grade 2 (±.02). SAT10 = Stanford Achievement Test–10th Edition; ROC = receiver operating characteristic.

Table 4.

Optimal Oral Reading Fluency Thresholds.

Statistic	1st		2nd			3rd
Statistic	W	S	F	W	S	F	W	S
At risk
A	.92	.95	.89	.91	.91	.84	.85	.84
Threshold	13	31	28	55	75	57	76	97
Sensitivity	.82	.81	.80	.81	.80	.80	.81	.80
Specificity	.86	.91	.82	.85	.85	.71	.71	.71
NPV	.90	.90	.89	.89	.89	.92	.92	.92
PPV	.77	.84	.69	.74	.75	.46	.48	.48
ρ	.36	.37	.34	.35	.35	.24	.25	.25
τ	.39	.35	.40	.38	.38	.42	.42	.41
Some risk
A	.91	.93	.86	.88	.87	.80	.82	.81
Threshold	19	47	41	76	96	72	89	110
Sensitivity	.81	.80	.80	.81	.80	.81	.80	.80
Specificity	.86	.90	.74	.76	.75	.63	.67	.66
NPV	.76	.76	.73	.74	.72	.80	.80	.80
PPV	.89	.92	.81	.83	.83	.64	.67	.66
ρ	.59	.59	.58	.59	.60	.45	.46	.46
τ	.54	.52	.58	.58	.58	.57	.55	.56
Target
A	.90	.91	.85	.87	.86	.80	.81	.81
Threshold	26	59	50	86	105	80	100	118
Sensitivity	.81	.80	.80	.80	.81	.81	.81	.80
Specificity	.84	.86	.70	.75	.73	.63	.65	.65
NPV	.59	.59	.54	.55	.54	.62	.63	.62
PPV	.94	.95	.89	.91	.91	.81	.82	.82
ρ	.75	.75	.75	.76	.76	.66	.67	.67
τ	.65	.64	.68	.67	.68	.66	.66	.65

Results Example: LNF

In Table 2, we report the results for LNF; the remaining results (i.e., Tables 2 -4) would be interpreted similarly. For the at-risk level, the accuracy for DIBELS LNF in the fall of kindergarten was low, A = .77 with 95% confidence interval of [.76, .78], just above .75, the value chosen as minimally acceptable for a screener. Students who were truly at risk of reading failure on the SAT10 had an 81% chance (sensitivity, [.80, .82]) of being identified as at-risk on the LNF screener if they scored below 6 (threshold). Specificity values were less than ideal; of students at or above the 20th percentile on the SAT10, 62% [.61, .63] were identified as true negatives. This implies that 38% of students who did not fall below the 20th percentile on the SAT10 were falsely identified as a positive. A different threshold for LNF could improve specificity but only at the expense of reduced sensitivity. The winter administration of LNF in kindergarten had a higher level of accuracy for the at-risk decision threshold, A = .84 [.83, .85], and consequently a more acceptable specificity value, .71 [.70, .72], for our chosen level of sensitivity. Nonetheless, the overall accuracy of LNF rarely exceeded the moderate range, A from .75 to .85.

Predictive values suggest the “clinical” significance of the screener (Pepe, 2003). Among the 56% of students (τ) who scored below 6 on the fall assessment of LNF in kindergarten and thus screened positive, the positive predictive value (PPV) shows that 62% will likely fall below the 20th percentile on the SAT10 in the spring. The negative predictive value (NPV) indicates that 81% of students who screened negative will likely score at or above the 20th percentile on the SAT10 in the spring. For a school with a similar base rate, the predictive values quantify the clinical implications of leaving students unsupported. They depend, however, on the base rate (ρ): PPV ranges between ρ and 1, NPV ranges between ρ – 1 and 1. For the at-risk threshold in the fall of kindergarten, PPV must lie between .43 and 1 and NPV values between .57 and 1. Hence, NPV will always be fairly high for risk levels with a low base rate, and the PPV will be high for criteria with high base rates.

Because most schools have different base rates from those reported here, the predictive values in Table 2 will not likely generalize. It is possible to recalculate predictive values and the proportion screened positive, τ, for different base rates using the sensitivity and specificity values from the tables (Pepe, 2003):

\begin{array}{l} PPV = \frac{ρ sensitivity}{(ρ sensitivity + (1 - ρ) (1 - specificity))}, \\ NPV = \frac{((1 - ρ) specificity)}{((1 - ρ) specificity + ρ (1 - sensitivity))}, \\ τ = ρ sensitivity + (1 - ρ) (1 - specificity) . \end{array}

For the winter LNF assessment in kindergarten in a school with just 15% of its students below the 20th percentile, the PPV decreases to .33, the NPV increases to .95, and τ changes to .36 from the values reported in Table 2. Note that A, sensitivity, and specificity do not depend on base rates as predictive values do (Smolkowski & Cummings, 2015).

Overall Accuracy of DIBELS

Figure 1 displays three representative ROC curves for each measure with their level of precision. LNF, NWF, and ORF demonstrated adequate accuracy (A > .75), with some administrations of ORF reaching accuracy levels above .90. Conversely, only the winter kindergarten administration of PSF achieved A ≥ .75, with some administrations barely surpassing chance (A = .50), such as the spring of Grade 1, where A = .56. Confidence bounds provide information about precision and uncertainty, and confidence intervals for all A values fell within ±.02 of the reported values (see notes to Tables 2 -4).

Figure 1.

Representative ROC curves for LNF, PSF, NWF, and ORF with the 40th percentile (benchmark) on the SAT10 (OAKS in Grade 3) at the end of the school year as the criterion.

Decision Thresholds

Figures 2 and 3 depict the NWF and ORF decision thresholds from Tables 3 and 4 by grade level and administration time. The thresholds increased substantially across each school year and dropped during the summer breaks. The figures also present the thresholds originally recommended by Good et al. (2001) and Good et al. (2002), and some differed markedly. The original thresholds to identify students at risk with NWF (dashed lines in Figure 2) underestimated the performance that students require to reach criterion levels on the SAT10. Some of the original cut scores were maintained at a constant level, such as for NWF in the spring of Grade 1 and fall of Grade 2. Our results suggest that teachers should focus on improving student performance throughout Grade 1.

Figure 2.

NWF decision thresholds determined with (ORRF; solid lines) data and those determined by Good, Simmons, Kame’enui, Kaminski, and Wallin (2002; dashed lines).

Figure 3.

ORF decision thresholds determined with (ORRF; solid lines) data and those determined by Good, Simmons, Kame’enui, Kaminski, and Wallin (2002; dashed lines).

Discussion

The present article applied signal detection methods to DIBELS6 as a diagnostic system for students from kindergarten through Grade 3. The ability of DIBELS6 to identify unsuccessful students was tested within an effective, research-based, tiered model of reading instruction (Baker et al., 2011). Results demonstrated that most DIBELS6 measures were accurate with a notable exception: for PSF, the area under the ROC curve was insufficient to recommend that teachers base decisions on this measure after the winter of kindergarten. Low correlations between PSF and later measures have been reported previously (e.g., Good et al., 2001), but reports had not included diagnostic accuracy. The accuracy of the three other measures—LNF, NWF, and ORF—indicates that they likely improve teachers’ decision making.

The decision thresholds chosen for each of DIBELS6 measure in this study provide optimal cut scores based on a specific likelihood of known outcomes. For the two risk classifications, decision thresholds were chosen to accurately identify at least 80% of students who fall below the 20th percentile on the criterion measure administered at the end of the school year as having substantial risk of failure and at least 80% students who fall below the 40th percentile with reduced but nonetheless some risk of failure. These decision thresholds improve the likelihood that students with similar deficiencies in reading skills will be treated consistently across classrooms, schools, and districts.

Some of the decision thresholds reported here differ from those previously recommended (Good et al., 2001; Good et al., 2002), as shown in Figures 2 and 3. Previous approach to recommended cut scores used ambiguous methods and a different criterion. The results in Good et al. (2001) also relied on fewer cases—302 to 378, depending on the grade level. Good et al. (2002) drew on a larger sample, but the report provided mostly descriptive information about performance in the various risk categories. The present analysis produced LNF cut scores considerably higher than past recommendations, especially after the fall of kindergarten. For NWF, the original cut scores remained constant after the winter of first grade. Gains in predictive utility, however, are available in the winter and spring of Grade 1, especially for students at risk. Finally, for ORF, the present results generally agree with those in Good et al. (2001) and Good et al. (2002) except when predicting at-risk students in Grades 1 and 3. Past cut scores, especially for those markedly below the present thresholds, may offer false hope to students and teachers. Differences between the decision thresholds in Good et al. (2001) and Good et al. (2002) and those reported here indicate that an update in screener performance standards may improve the efficiency of supplemental instruction delivery.

Target Performance

The present investigation introduces a new concept: target performance. The target threshold was intended to help teachers focus on more ideal performance. It shows teachers how much better students must perform to minimize the likelihood of performing below standards, in this case, the 40th percentile, on comprehensive tests.

Limitations

The data for this study were generated from English-proficient students attending ORRF schools (Baker et al., 2011) assessed with their respective criterion measures. Decision thresholds presented here may not generalize to all children in all schools. The use of sensitivity to set decision thresholds is not sensitive to base rates and, unlike predictive values, should minimize differences across any schools that aim to achieve the same criterion level of performance.

Future Directions

Many published evaluations of screening systems use small samples that result in limited precision (e.g., Hintze, Ryan, & Stoner, 2003). Nelson (2008) presented the classification results for DIBELS6 with 177 kindergarten students but no estimates of precision. For a test of NWF that produced sensitivity of .62, we calculated a confidence interval of ±.14 or [.48, .76] (Harper & Reeves, 1999), which covers the range from very poor to moderate sensitivity. The broad confidence bounds demonstrate why reporting precision is important.

Researchers may combine multiple screeners to choose a decision threshold. In areas of literacy and numeracy, however, many of the available measures assess specific skills and identify important skill deficits (e.g., decoding versus phonemic awareness; number operations versus problem solving), suggesting that a combination score may not yield the most useful information for educators. Evaluations of combinations of tests may be conducted with a number of methods (Pepe, 2003). Gigerenzer and Goldstein (1996) suggested the Take the Best heuristic, but McGrath (2008) showed that “the best single predictor often can perform better than do multiple predictors when the predictors are combined using methods common in applied settings” (p. 195). The costs and benefits of combined or chained screeners and screener/comprehensive-test combinations have yet to be evaluated in education.

Research should also account for the scope and sequence of curricula, which could influence the validity of the decision thresholds. Although PSF was recommended for the winter of kindergarten, it assumes students received relevant instruction. Some curricula do not teach phoneme segmentation until after the winter assessment (e.g., Read Well Kindergarten; Sprick, Jones, Dunn, & Gunn, 2008). The validity of academic screeners depends in part on their alignment with instruction, an issue that requires further investigation.

Implications for Practice

DIBELS6 measures are generally accurate, and specific and decision thresholds can identify students who require supports. Because the decision thresholds presented here were rigorously evaluated, schools may choose to update their standards based on these findings and possibly reduce the use of PSF. The present analysis relied on statistics not affected by base rates and the large sample provided narrow confidence bounds. We thus believe the results are generalizable to many schools nationwide. The results of this study do not, however, prescribe the supports required by struggling students. Some students may need only minimal supports, whereas others may benefit from intensive instruction. Schools must determine the level of supports based on the needs of students and resources available.

Many other screening systems exist, with others sure to arrive soon, but new does not guarantee better. Like instructional fads (Slavin, 1989), schools, districts, and states are quick to adopt the next assessment system. Schools have begun to adopt DIBELS Next, for example, which does not yet have a peer reviewed evaluation. Publishers of newer systems should strive to the best results herein. Until then, many of the DIBELS6 measures can continue to serve students well.

Footnotes

Acknowledgements

The authors acknowledge Drs. Scott Baker, John Seeley, and Hank Fien for their suggestions for the general approach and comments on manuscript drafts. We also acknowledge Lisa Strycker for her comments, edits, and general help with the manuscript.

Authors’ Note

The data used in this report were previously published in Baker et al. (2011); Baker et al. (2008); and . The opinions expressed are those of the authors and do not represent views of Oregon Research Institute, the University of Maryland, or the U.S. Department of Education.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by an Oregon Reading First subcontract from the Oregon Department of Education to the University of Oregon (8948). The original Oregon Reading First grant was made from the U.S. Department of Education to the Oregon Department of Education (S357A0020038). This research was also supported by two grants (R324A090111, R324A090104) from the Institute of Education Sciences, U.S. Department of Education.

References

Baker

S. K.

Smolkowski

Katz

Fien

Seeley

J. R.

Kame’enui

E. J.

Thomas Beck

(2008). Reading fluency as a predictor of reading proficiency in low performing high poverty schools. School Psychology Review, 37, 18-37.

Baker

S. K.

Smolkowski

Smith

J. M.

Fien

Kame’enui

E. J.

Thomas Beck

(2011). The impact of Oregon Reading First on student reading outcomes. Elementary School Journal, 112, 307-331.

Cummings

K. D.

Dewey

Latimer

Good

R. H.

(2011). Pathways to word reading and decoding: The roles of automaticity and accuracy. School Psychology Review, 40, 284-295.

Cummings

K. D.

Park

Bauer Schaper

H. A.

(2012). Form effects on DIBELS Next oral reading fluency progress-monitoring passages. Assessment for Effective Intervention, 38, 91-104.

Dynamic Measurement Group. (2008). DIBELS 6th edition technical adequacy information (Technical Report No. 6). Eugene, OR: Author. Retrieved from http://dibels.org/pubs.html

Fien

Baker

S. K.

Smolkowski

Smith

J. M.

Kame’enui

E. J.

Thomas Beck

(2008). Using nonsense word fluency to predict reading proficiency in K-2 for English learners and native English speakers. School Psychology Review, 37, 391-408.

Gersten

Compton

Connor

C. M.

Dimino

Santoro

Linan-Thompson

Tilly

W. D.

(2008). Assisting students struggling with reading: Response to intervention and multi-tier intervention for reading in the primary grades: A practice guide (NCEE No. 2009-4045). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education.

Gigerenzer

Goldstein

(1996). Reasoning the fast and frugal way: Models of bounded rationality. Psychological Review, 103, 650-669.

Good

R. H.

III Kaminski

R. A.

(Eds.). (2002). Dynamic Indicators of Basic Early Literacy Skills (6th ed.). Eugene, OR: Institute for the Development of Educational Achievement. Available from http://dibels.uoregon.edu/

10.

Good

R. H.

III Simmons

D. C.

Kame’enui

E. J.

(2001). The importance and decision-making utility of a continuum of fluency-based indicators of foundational reading skills for third-grade high-stakes outcomes. Scientific Studies of Reading, 5, 257-288.

11.

Good

R. H.

III Simmons

Kame’enui

Kaminski

R. A.

Wallin

(2002). Summary of decision rules for intensive, strategic, and benchmark instructional recommendations in kindergarten through third grade (Technical Report No. 11). Eugene: University of Oregon, Center on Teaching and Learning.

12.

Harcourt Educational Measurement. (2002). Stanford Achievement Test (SAT10). San Antonio, TX: Author.

13.

Harper

Reeves

(1999). Reporting of precision of estimates for diagnostic accuracy: A review. British Medical Journal, 318, 1322-1323.

14.

Hintze

J. M.

Ryan

A. L.

Stoner

(2003). Concurrent validity and diagnostic accuracy of the dynamic indicators of basic early literacy skills and the comprehensive test of phonological processing. School Psychology Review, 32(4), 541-556.

15.

Jacowitz

K. E.

Kahneman

(1995). Measures of anchoring in estimation tasks. Personality and Social Psychology Bulletin, 21, 1161-1166.

16.

Juel

(1988). Learning to read and write: A longitudinal study of 54 children from first through fourth grades. Journal of Educational Psychology, 80, 437-447.

17.

Kaminski

R. A.

Good

R. H.

(1996). Toward a technology for assessing basic early literacy skills. School Psychology Review, 25, 215-227.

18.

Kaminski

R. A.

Good

R. H.

(1998). Assessing early literacy skills in a problem solving model: Dynamic Indicators of Basic Early Literacy Skills. In Shinn

M. R.

(Ed.), Advanced applications of curriculum-based measurement (pp. 113-142). New York, NY: Guilford Press.

19.

Marston

Muyskens

Lau

Canter

(2003). Problem-solving model for decision making with high-incidence disabilities: The Minneapolis experience. Learning Disabilities Research & Practice, 18, 187-200. doi:10.1111/1540-5826.00074

20.

Martin

S. D.

Shapiro

E. S.

(2011). Examining the accuracy of teachers’ judgments of DIBELS performance. Psychology in the Schools, 48, 343-356.

21.

McGrath

R. E.

(2008). Predictor combination in binary decision-making situations. Psychological Assessment, 20, 195-205. doi:10.1037/a0013175

22.

Nelson

J. M.

(2008). Beyond correlational analysis of the Dynamic Indicators of Basic Early Literacy Skills (DIBELS): A classification validity study. School Psychology Quarterly, 23, 542-552.

23.

Oregon Department of Education. (2008). OAKS—Test administration manual: 2008-2009 school year. Retrieved from http://www.ode.state.or.us/teachlearn/testing/manuals/2009/0809tam.pdf

24.

Pepe

M. S.

(2003). The statistical evaluation of medical tests for classification and prediction. New York, NY: Oxford University Press.

25.

SAS Institute. (2009). Base SAS^® 9.2 procedures guide: Statistical procedures (2nd ed.). Cary, NC: Author. Retrieved from http://support.sas.com/documentation/index.html

26.

Slavin

R. E.

(1989). PET and the pendulum: Faddism in education and how to stop it. Phi Delta Kappan, 70, 752-758.

27.

Smolkowski

Cummings

(2015). Evaluation of diagnostic systems: The selection of students at risk for reading difficulties. Assessment for Effective Intervention. Advance online publication. doi:10.1177/1534508415590386

28.

Sprick

Jones

S. V.

Dunn

Gunn

(2008). Read well kindergarten: Critical foundations in beginning reading (2nd ed.). Longmont, CO: Sopris West.

29.

Stecker

P. M.

Fuchs

L. S.

Fuchs

(2005). Using curriculum-based measurement to improve student achievement: Review of research. Psychology in the Schools, 42, 795-819. doi:10.1002/pits.20113

30.

Swets

J. A.

(1973). The relative operating characteristic in psychology. Science, 182, 990-1000.

31.

Swets

J. A.

(1988). Measuring the accuracy of diagnostic systems. Science, 240, 1285-1293.

32.

Swets

J. A.

Dawes

R. M.

Monahan

(2000). Psychological science can improve diagnostic decisions. Psychological Science in the Public Interest, 1, 1-26.