Assessment Fidelity in Reading Intervention Research

Abstract

Recent studies indicate that examiners make a number of intentional and unintentional errors when administering reading assessments to students. Because these errors introduce construct-irrelevant variance in scores, the fidelity of test administrations could influence the results of evaluation studies. To determine how assessment fidelity is being addressed in reading intervention research, we systematically reviewed 46 studies conducted with students in Grades K–8 identified as having a reading disability or at-risk for reading failure. Articles were coded for features such as the number and type of tests administered, experience and role of examiners, tester to student ratio, initial and follow-up training provided, monitoring procedures, testing environment, and scoring procedures. Findings suggest assessment integrity data are rarely reported. We discuss the results in a framework of potential threats to assessment fidelity and the implications of these threats for interpreting intervention study results.

Keywords

assessment fidelity reading intervention testing

In intervention studies, the importance of monitoring treatment integrity (operationalized here as the extent to which the intervention was implemented as intended) is well known, albeit often lacking from reports of the methods employed (Gearing et al., 2011; Swanson, Wanzek, Haring, Ciullo, & McCulley, 2011). Researchers are admonished to monitor treatment integrity, or implementation fidelity, to increase confidence in concluding that the outcomes are attributable to the independent variable (Century, Freeman, & Rudnick, 2008; Gersten et al., 2005). Examinations of data from individual studies suggest treatment integrity influences the effectiveness of early reading interventions (Foorman & Moats, 2004; Stein et al., 2008) as well as reading interventions for middle school students (e.g., Benner, Nelson, Stage, & Ralston, 2011; Vaughn et al., 2013). Moreover, simulated comparisons of effect sizes at different levels of fidelity indicate poor treatment integrity likely produces systematically biased results that minimize the true impact of the intervention (Stockard, 2010).

Recent work suggests assessment fidelity (operationalized here as the extent to which performance-based reading assessments are administered and scored as intended) also might play an important role in interpreting student outcomes. For example, significant differences in oral reading fluency and retell scores can result from delivering different directions and prompts (Reed & Petscher, 2012), as well as having a different assessor or using a different testing environment (Derr & Shapiro, 1989). Furthermore, estimates of student growth over time during progress monitoring have been demonstrated to vary appreciably as the data set quality is compromised—some of which is likely due to examiner variability (Christ, Zopluoglu, Long, & Monaghen, 2012).

There is reason to believe unintended alterations of testing protocols might be occurring with regularity both in research and authentic school settings. A study utilizing hired data collectors in Grades 6 to 8 found as much as 8% of closely monitored test administrations demonstrated extreme lack of fidelity, and 91% of accurate administrations still had correctable errors identified in double scoring (Reed & Sturges, 2012). In research conducted with Head Start and kindergarten children, almost 40% of the variance in end of year scores was attributable to the assessor when the test was administered by the students’ own classroom teacher (Waterman, McDermott, Fantuzzo, & Gadsden, 2011). However, variance associated with examiners in the study who were outside of and independent from the classroom was only 5.3% or less. A separate study of archival data gathered from typical school administrations found that 16% of the variance in oral reading fluency scores of students in Grade 3 was attributable to examiners (Cummings, Biancarosa, Schaper, & Reed, 2013). This error effect was observed after accounting for student- and school-level variance and correcting computation errors on scoring forms. Hence, assessment fidelity can introduce systematic error, or construct irrelevant variance, that compromises the measurement of the intended reading constructs.

Construct Irrelevant Variance in Standardized Testing

Systematic error has been identified as an issue not only for individually administered reading tests, such as oral reading fluency and retell, but also for group administered, standardized tests. Chief among these concerns has been the presence of irregularities suggestive of cheating (U.S. Department of Education, 2013). Haladyna and Downing (2005) outlined a taxonomy of construct-irrelevant variance sources that threaten the interpretation of high-stakes tests in which cheating is only one category. The others are uniformity in and types of test preparation; test development, administration, and scoring; and students. In the sections that follow, we describe the particular sources of examiner error within these categories that might apply to reading tests used for intervention research purposes.

Unethical Test Preparation

Critics of practices associated with high-stakes tests have noted that educators are inclined to teach to the test (Pedulla et al., 2003; Tanner, 2013). This type of unethical preparation might include narrowing the curriculum to the specific content on which students will be assessed or having students practice items from current or older versions, depending on the accessibility of these forms. Although these behaviors have been exhibited by school personnel subject to accountability systems, archival data from state-mandated assessments are often used to screen participants for reading intervention studies (e.g., Faggella-Luby & Wardwell, 2011; Thompson & Davis, 2002) or determining student improvements (e.g., Ritter & Saxon, 2011; Vaughn et al., 2010). In addition, the potential exists for classroom teachers or research staff to bias outcomes on study measures should recommendations for test security not be enacted (Wollack & Fremer, 2013).

Test Administration

During testing, student performance might be influenced by changing the wording of the directions (Colón & Kranzler, 2006; Reed & Petscher, 2012), timing (Derr-Minneci & Shapiro, 1992), or a number of environmental distractions (Christ, Zopluoglu, Long, & Monaghen, 2012). The impact of these factors has been more closely studied with individually administered reading tests, but such alterations have also been documented during group administered assessments (e.g., Nolen, Haladyna, & Haas, 1992; Sackes, 2000). The location of the test administration may be more problematic during research studies than during state- or district-mandated assessments. That is, space restrictions at school sites may limit researchers’ abilities to control aspects of the testing environment such as the number of students per room or the ambient noise present (Reed & Sturges, 2012).

Test Scoring

When tests are individually administered, pressure is placed on the examiner to record student responses accurately. In the case of some assessments, such as spelling and intelligence tests, there are basal or ceiling rules examiners must apply in real time. Other assessments may require transferring, converting, or calculating a score. All these scenarios introduce the possibility of mistakes being made (Charter, Walden, & Padilla, 2000; Loe, Kadlubek, & Marks, 2007; Ramos, Alfonso, & Schermerhorn, 2009). Many group administered tests are machine scored, which in theory should limit the kinds of examiner errors reportedly found when double scoring test documents by hand (Cummings et al., 2013; Reed & Sturges, 2012). Still, there have been a number of high-profile machine scoring errors that highlight the importance of verifying the accuracy of all test data (e.g., Baker, 2013; Metz, 2007; Romano, 2006).

Individual Cheating

Among the egregious errors observed by Reed and Sturges (2012) was an instance of coaching in which the examiner used hand motions to encourage a student to continue providing more information until he delivered the correct response. Although this was committed during an individually administered assessment, it is equally likely that students could be coached during standardized group testing. Providing answers, cueing students to incorrect answers, and leaving tested information or “cheat sheets” visible in the room have all been documented as irregularities on high-stakes assessments (Amrein-Beardsley, Berliner, & Rideau, 2010). As noted for unethical test preparation, data from these measures might be used in reading intervention studies for participant selection or estimating treatment effects. Moreover, the potential for examiner bias in intervention research has prompted recommendations for ensuring data collectors are blind to study condition (Gearing et al., 2011; Gersten et al., 2005).

Purpose

Given the variety of systematic errors that are possible when testing student participants in a reading intervention study, an accounting of assessment fidelity might be just as important as documentation of treatment integrity in establishing the quality of research results. Whereas the extant literature yields a rich history of reviewing implementation fidelity (Dane & Schneider, 1998; Gresham, MacMillan, Beebe-Frankenberger, & Bocian, 2000; McIntyre, Gresham, DiGennaro, & Reed, 2007; O’Donnell, 2008; Swanson et al., 2011), we could locate no synthesis of assessment fidelity. Therefore, we conducted a systematic review of studies to address the question: To what extent is assessment fidelity reported in reading intervention research conducted in elementary and middle schools?

Method

Search Procedures

We searched for relevant articles to include in this review in two ways. First, the ERIC, Academic Search Complete, and PsycINFO electronic databases were searched for peer-reviewed publications using combinations of the following terms: read*, learning disabilit*, reading disabilit*, literacy, dyslex*, and intervention. Second, we examined the references of recent reading intervention syntheses and meta-analyses (e.g., Flynn, Zheng, & Swanson, 2012; Tran, Sanchez, Arellano, & Swanson, 2011) to determine if any studies were missed in the electronic search. Over 9,000 articles published in the last decade (2001-July 2012), a time during which increased attention has been placed on implementation fidelity in educational research (National Research Council, 2002), were identified and evaluated against the following criteria for inclusion in this review:

Participating students were in Grades K–8 (ages 5–14). Studies including younger or older students were accepted if the majority of the participants were in Grades K–8.

At least some of the students were identified with a learning disability (LD), reading disability (RD), dyslexia, or at-risk for reading failure. If all students were identified on the basis of intellectual disability alone, the study was excluded (e.g., Allor, Mathes, Roberts, Cheatham, & Champlin, 2010).

The instructional intervention was focused on reading skills: word identification, fluency, vocabulary, or comprehension. Studies were excluded if a major focus of the intervention was oral or auditory language, attention therapy, math skills, or behavioral or affective factors (e.g., Nelson & Manset-Williamson, 2006).

The studies had an experimental or quasi-experimental design with a defined comparison group. These designs were chosen because they often use larger groups of students and, therefore, are more likely to exhibit factors with the potential for affecting assessment fidelity. Specifically, they are more likely to utilize multiple testers, have testers assessing multiple students, incorporate more than one instrument, and include treatment conditions from which the testers may or may not be blinded (Gersten et al., 2005).

Assessments used were standardized measures of reading with publicly available technical manuals in keeping with quality indicators for intervention research (Gersten et al., 2005; Towne, Wise, & Winters, 2005). Studies that employed only tests of listening comprehension or content learning were excluded (e.g., DiCecco & Gleason, 2002; Wilder & Williams, 2001). Studies that included informal measures of reading, such as researcher-developed measures without established technical adequacy or that are not publicly available, were included only if they also administered and reported standardized measures. Standardized measures have clearly delineated and disseminated protocols for their use, thus providing a known standard against which the fidelity of their use could be compared objectively.

The intervention had to target development of English reading skills. The students’ native language could have been used in some instruction or assessment as long as the goal was to teach students to read in English. For example, bilingual and transitional English instruction were accepted (e.g., Kamps et al., 2007) but two-way immersion was rejected (e.g., Calhoon, Otaiba, Cihak, King, & Avalos, 2007).

Consistent with recent syntheses of intervention fidelity (Gearing et al., 2011; Swanson et al., 2011), the study was reported in a peer-reviewed journal.

Of the more than 9,000 abstracts evaluated, 46 met all seven criteria and were included in the analysis. These were drawn from 13 different journals, with two journals (Journal of Learning Disabilities and Learning Disabilities Research and Practice) accounting for more than half (n = 24) of the studies. All journals were representative of special education publications, which we believe was an artifact of the search terms employed. Although we acknowledge other journals dedicated to reading research (e.g., Reading Research Quarterly, Scientific Studies of Reading, Reading and Writing: An Interdisciplinary Journal) would likely contribute additional articles, we decided to maintain the special education focus of the corpus and not manually search other journals for two reasons. First, students in special education or those receiving other types of supplementary education (e.g., dyslexia services) arguably are the most affected by instructional, placement, and outcome decisions made based on reading assessment data. Second, assessment fidelity has not received as much attention as treatment integrity (Reed & Sturges, 2012), yet previous syntheses of the latter have found that little information on implementation fidelity is reported (Gearing et al., 2011; Swanson et al., 2011). Hence, we approached this review of assessment fidelity as exploratory and not exhaustive.

Data Analysis

Coding the 46 studies proceeded via a three-stage iterative process of revision and refinement. In the first stage, three of the four investigators refined the codes used to review the studies. In the second, two investigators preliminarily coded all articles in the corpus. And in the third stage, the four investigators double coded all articles and resolved coding discrepancies.

Stage 1 Coding Procedures

Given that no previous review of assessment fidelity was identified in the extant literature, an original coding scheme had to be devised. Using suggested threats to assessment fidelity (see introduction; Haladyna & Downing, 2005; Reed & Sturges, 2012) and assessment administration requirements described in the manuals of commonly used measures (e.g., Good & Kaminski, 2002; Torgesen, Wagner, & Rashotte, 1999; Woodcock, 1998), three investigators decided on preliminary codes: number of students tested, measures administered, training of testers, testing procedures, monitoring of test administrations, follow-up training, and verification of scoring. We then employed an iterative refinement process. That is, the investigators independently coded an article chosen at random, discussed their work, and made adjustments to the coding instrument before proceeding to a second and third study in succession.

Refinements of the coding scheme took into account complexities in the articles and characteristics not previously considered such as variability of testing purposes, functions, and procedures within each study; differing counts of treatment and control subjects; and the use of extant data. Our final code sheet consisted of 28 separate codes. Discrepancies in this phase were considered resolved when all three raters were in agreement with (a) the classification of the information in the articles and (b) the sufficiency and accuracy with which the coding scheme captured that information.

Stage 2 Coding Procedures

During the preliminary coding stage, the other 43 articles in the corpus of 46 studies were independently coded by two investigators. A list of the studies was ordered alphabetically by first author and divided. One investigator coded the first 22 articles, and another coded the second 21 articles in the list. The raters discussed their work with the other investigators in a weekly conference call and made further refinements to the coding schemes to address errors (e.g., the classification of testing procedures vs. scoring procedures), allow for hypothesized sources of administration errors not captured by current codes (e.g., increasing the complexity of testing by administering researcher-developed measures in addition to standardized assessments), or address unique issues of the studies (e.g., using extant data as the solitary pre- and posttest).

Stage 3 Coding Procedures

The initial double coding involved a third investigator independently coding a random sample of 10 articles, five from each of those assigned to the first two raters. A fourth investigator then highlighted all discrepancies among the raters. In the discussion among all investigators, it became apparent that double coding all studies in the corpus was necessary to ensure accuracy. The discrepancies between two coders exceeded acceptable standards (Krippendorff, 2004) for several reasons. First, there was no precedent for coding assessment fidelity information, so essentially the instrument had to be developed while it was being used. There also was not a standard for reporting assessment fidelity, as exists for treatment integrity (Gersten et al., 2005), so relevant information was more widely distributed across sections of the articles. This increased the likelihood that some things were missed by a single coder. Finally, there were many inconsistencies in the way information was reported within and between studies, including but not limited to (a) narrative descriptions of measures not matching what was included in data tables, (b) lack of clarity regarding whether the researchers administered a particular test or gathered scores from administration by school personnel, and (c) screening measures described in a different section of the article than where pre- and posttests were reported.

To overcome these challenges, all four investigators were assigned a random sample of the corpus such that the original raters’ sets of articles were distributed among three other investigators for double coding. The two raters of each article (i.e., the original and double coder) then conferred and resolved discrepancies until they achieved 100% agreement. Coded information was checked against the source article a final time when being organized into tables.

Results

Study Features

This study sought to determine the extent to which assessment fidelity is reported in reading intervention research conducted with students in Grades K–8 identified with LD, RD, or reading difficulties. We first present descriptive information (i.e., student characteristics, eligibility criteria related to screening measures administered, sample size, number and names of standardized measures, and number of informal measures) on the 46 studies meeting inclusion criteria in Table 1 and in the sections that follow.

Table 1

Description of studies included in the assessment fidelity literature synthesis

	Participant information					Measure information^a
Study	Grade level(s) and language background	No. of screening measures and eligibility criteria (related to screening measures)	n	Treatment n-size	Comparison n-size	Number of standardized measures	Names of standardized measures	Number informal measures (not including surveys)
Aaron, Joshi, Gooden, & Bentum (2008) ^b	Grades: 2 to 5	Screening measures: 0	330	171	159	11	WLPB: Listening Comprehension, Passage Comprehension, Word Attack, Oral Vocabulary, Reading Vocabulary	4
	Language: All used English at home	Eligibility criteria: N/A					SDRT: Reading Comprehension
							TOWRE: Phonemic Decoding Efficiency, Sight Word Efficiency
							CTOPP: Elision, Rapid Color Naming, Rapid Letter Naming
Allinder, Dunse, Brunken, & Obermiller-Krolikowski (2001)	Grade: 7	Screening measures: 0	49	33	16	8	MAT: Reading Comprehension^e and Total Test Score^e	0
	Language: NR	Eligibility criteria: N/A					WRMT-R: Word Attack, Word Identification, Passage Comprehension
							CELF-3: Listening, Paragraphs
							Monitoring Basic Skills Progress-Reading: Maze^e
Allor & McCathren (2004)	Grade: 1	Screening measures: 2	243	137	106	8	CTOPP: Phoneme Elision	1
	Language: NR	Eligibility criteria: Reading just a few words per minute and lowest performing on CTOPP Phoneme Elision subtest					WJ-R: Word Identification, Word Attack, Passage Comprehension
							TOWRE: Phonemic Decoding Efficiency, Sight Word Efficiency
							DIBELS: Phoneme Segmentation Fluency, Nonsense Word Fluency
Berninger, Abbott, Vermeulen, & Fulton (2006) ^b	Grade: 2	Screening measures: 1	93	47	46	5	WRMT-R: Word Identification, Word Attack	3
	Language: 18 students spoke multiple languages in the home; 5 students spoke no English in the home	Eligibility criteria: Failed DRA at school district’s standards for Grade 2 reading					WISC-3: Verbal IQ
							DRA
							GMRT
Bhat, Griffin, & Sindelar (2003) ^c	Grades: 6 to 8	Screening measures: 1	40	20	20^c	9	LAC	0
	Language: NR	Eligibility criteria: Scored below 93 (minimum normative score for middle school) on the Lindamood Auditory Conceptualization Test					WRMT-R: Word Identification
							CTOPP: Elision, Blending Words, Nonword Repetition, Phoneme Reversal, Blending Nonwords, Segmenting Words, Segmenting Nonwords
Calhoon (2005)	Grades: 6 to 8	Screening measures: 4	38	18	20	4	WJ-III: Letter-Word Identification, Word Attack, Reading Fluency, Passage Comprehension	0
	Language: NR	Eligibility criteria: On an IEP with reading goals and reading three grade levels below current grade placement on WJ-III reading subtests
Calhoon, Sandow, & Hunter (2010)	Grades: 6 to 8	Screening measures: 6	90	61	29	7	WJ-III: Letter-Word Identification, Word Attack, Spelling, Fluency, Passage Comprehension	0
	Language: no students were receiving English language support	Eligibility criteria: Not on IEP for reading. Scored at or below 3.5 grade level on Gray Silent Reading Comprehension Test and two WJ-III subtests (Word Attack and Word Identification). Had IQ of 75 or above.					GSRT^e
							AIMSweb: Oral Reading Fluency
Cartledge, Yurick, Singh, Keyes, & Kourea (2011)	Grades: K to 2	Screening measures: 5	155	89^d	66	5	DIBELS: Letter Naming Fluency, Phoneme Segmentation Fluency, Nonsense Word Fluency, Oral Reading Fluency	0
	Language: 21.4% to 38.5% ELL (varied across years and treatment group; variety of first languages); none in comparison group ELL	Eligibility criteria: Treatment students: below benchmark on DIBELS Letter Naming Fluency, Phoneme Segmentation Fluency, and Nonsense Word Fluency; and low achieving on WJ-III Letter Word Identification and Word Attack. Comparison students were at or above benchmark.					WJ-III: Letter-Word Identification, Word Attack
Case et al. (2010)	Grade: 1	Screening measures: 2	30	15	15	5	CTOPP: Phoneme Elision, Rapid Letter Naming	4
	Language: NR	Eligibility criteria: Scored 4 or lower on Developmental Reading Assessment and lowest performing on informal measure of word identification fluency. Excluded if already receiving reading pullout.					WRMT-R: Word Attack and Word Identification
							DRA
Denton et al. (2010)	Grade: 1	Screening measures: 5	422	182	240	13	TPRI: Letter Sounds, Blending Phonemes, Word Reading	1
	Language: 25 ELL but receiving instruction in English	Eligibility criteria: Failed Texas Primary Reading Inventory measures (Letter Sounds, Blending Phonemes, Word Reading), scored 8 or less on task derived from WJ-III Letter Word Identification subtest, and read 8 or fewer words correct per minute on Grade 1 oral reading fluency passage.					WJ-III: Letter/Word Identification, Word Attack, Passage Comprehension, Spelling
							CTOPP: Blending Words, Blending Nonwords, Segmenting Words
							TOWRE: Sight Work Efficiency, Phonemic Decoding Efficiency
							CMERS: Oral Reading Fluency
Denton, Wexler, Vaughn, & Bryan (2008)	Grades: 6 to 8	Screening measures: 1	38	20	18	7	DIBELS: Oral Reading Fluency	0
	Language: 22 ELL	Eligibility criteria: Unable to read 80 words correct per minute on DIBELS ORF					WJ-III: Passage Comprehension, Letter-Word Identification, Word Attack, Spelling
							TOWRE: Sight Word Efficiency
							PPVT-III
Faggella-Luby & Wardwell (2011)	Grades: 5 to 6	Screening measures: 1	81	44	37	3	AIMSweb: Maze^e	1
	Language: 21 ELL	Eligibility criteria: Students selected for study if scores on Degrees of Reading Progress were below 48 for 5th graders and 52 for 6th graders.					GMRT-4^e
							CRMT: Degrees of Reading Progress
Gerber et al. (2004)	Grades: K–1	Screening measures: 1	43	28	15	2	WJ-III (English): Word Attack, Letter/Word Identification	6
	Language: All students began school as ELLs (native Spanish speaking)	Eligibility criteria: Spoke Spanish as a first language and performed in bottom 20% on researcher developed test of bilingual phonological skills
Guthrie et al. (2009)	Grade: 5	Screening measures: 0	156	94	62	2	GMRT: Comprehension Section	4
	Language: 9 ELL	Eligibility criteria: N/A					WJ-III: Reading Fluency Test
Helf, Cooke, & Flowers (2009) ^c	Grade: 1	Screening measures: 1	49	25	24	3	DIBELS: Phoneme Segmentation Fluency, Nonsense Word Fluency, Oral Reading Fluency	2
	Language: NR	Eligibility criteria: In the “strategic-additional intervention” level based on DIBELS fall benchmark
Hook, Macaruso, & Jones (2001)	Ages 7–12	Screening measures: 4	31	20	11	10	WRMT-R: Word Attack, Word Identification, Passage Comprehension	0
	Language: NR	Eligibility criteria: Had full scale IQ ≥80 and verbal IQ ≥90 as measured by WISC-3. Treatment students scored below the 16th percentile on WRMT Word Attack and/or Word Identification. Comparison group was low on phonological awareness as measured by the Lindamood Auditory Conceptualization Test.					LAC
							TOLD-P:3
							TOLD-I:2
							TAALD
							TOWS-3
							RAN-RAS
							WJ: Numbers Reversed
							WISC-3
Hudson, Isakson, Richman, Lane, & Arriaza-Allen (2011) ^c	Grade: 2	Screening measures: 2	56	27	29	5	DIBELS: ORF	1
	Language: 4 ELL	Eligibility criteria: Median score at or below 35th percentile on DIBELS ORF and at or above 45th percentile on WJ-III Picture Vocabulary					WJ-III: Reading Comprehension, Picture Vocabulary
							KTEA-II: Decoding Subtest
							TOWRE: Phonemic Decoding Efficiency
Joshi, Dahlgren, & Boulware-Gooden (2002)	Grade: 1	Screening measures: 0	56	24	32	3	TOPA	0
	Language: NR	Eligibility criteria: N/A					WRMT-R: Word Attack
							GMRT: Reading Comprehension Section
Kamps et al. (2007)	Grades: 1 to 2	Screening measures: 0	318	176	142	5	WRMT: Word Attack, Word Identification, Passage Comprehension	0
	Language: 170 ELL (99 Spanish first language; other 71 Somalian, Sudanese, Vietnamese)	Eligibility criteria: N/A					DIBELS: Nonsense Word Fluency, Oral Reading Fluency
Kim et al. (2006)	Grades: 6 to 8	Screening measures: 5	34	16	18	5	WRMT-R: Passage Comprehension, Word Attack, Word Identification	2
	Language: NR	Eligibility criteria: At the 2.5 grade level or above on WRMT Word Identification or Word Attack, and 1+ years below grade level on WRMT Passage Comprehension or Gates-MacGinitie					GMRT: Vocabulary, Passage Comprehension
Leafstedt, Richards, & Gerber (2004)	Grade: K	Screening measures: 0	62	16	46	5	WJ-III: Word Identification, Word Attack	3
	Language: 94% treatment group ELL (all Spanish first language); 90% comparison group ELL (50% Spanish first language)	Eligibility criteria: N/A					PPVT
							DIBELS: Nonsense Word Fluency, Phoneme Segmentation Fluency
Lovett et al. (2008)	Grades: 2 to 8	Screening measures: 3	166	122	44	18	PPVT	4
	Language: All ELL or EFL (at least nine different first languages); excluded ELLs with less than 2 years in residency	Eligibility criteria: Scored 1+ standard deviation below age level norm (standard score < 85) on WRAT reading, and WRMT Word Identification and Word Attack					CELF- 3: Word Classes, Formulated Sentences, Recalling Sentences, Concepts & Directions
							WISC-3: Verbal IQ, Performance IQ
							RAN-RAS: Numbers, Letters
							CTOPP: Blending Words, Elision
							WRMT-R: Word Identification, Word Attack, Passage Comprehension
							WRAT-3: Reading, Spelling, Dictation, Arithmetic
Manset-Williamson & Nelson (2005) ^c	Grades: 4 to 8	Screening measures: 8	20	11	9	8	WJ-III: Word Attack, Word Identification, Fluency, Passage Comprehension	4
	Language: none ELL	Eligibility criteria: Had grade equivalent scores on WJ-III fluency and/or passage comprehension, and 75+ standard core on Reynolds Intellectual Screening Test. Scored 1+ standard deviations below the mean on CTOPP subtests (Phonological Awareness, Phonological Memory, Rapid Naming).					CTOPP: Phonological Awareness, Phonological Memory, Rapid Naming
							RIST
Mathes & Babyak (2001)	Grade: 1	Screening measures: 2	130	81	49	4	WRMT-R: Word Identification, Word Attack, Passage Comprehension	1
	Language: NR	Eligibility criteria: Highest achieving read 30+ words correct per minute (WCPM) on ORF, average achieving read 8–14 WCPM, and low achieving read 4 or fewer WCPM. Low achieving also scored poorly on DIBELS Phoneme Segmentation Fluency.					DIBELS: Phoneme Segmentation Fluency
Morris et al. (2012) ^d	Grades: 2 to 3	Screening measures: 6	279	279	N/A^f	11	WRMT-R: Word Identification, Word Attack, Passage Comprehension	0
	Language: none ELL	Eligibility criteria: Had KBIT composite score above 70. Scored at or below 16th percentile on WRAT reading or one of WRMT subtests (Passage Comprehension, Word Identification, Word Attack)					WRAT-3: Reading, Spelling, Arithmetic
							GORT-III
							KBIT: Matrices, Verbal Memory
							TOWRE: Sight Word Efficiency, Phonemic Decoding Efficiency
Nelson, Benner, & Gonzalez, (2005)	Grade: K	Screening measures: 3	36	18	18	9	DIBELS: Letter Naming Fluency, Phoneme Segmentation Fluency, Initial Sound Fluency, Nonsense Word Fluency	1
	Language: 2 ELL	Eligibility criteria: T-scores of 60= on teacher-completed scales of maladaptive behavior, DIBELS Phoneme Segmentation Fluency of 18 or fewer, and DIBELS Letter Naming Fluency of <27					CTOPP: Elision, Blending Words, Sound Matching, Rapid Color Naming, Rapid Object Naming
O’Connor, Fulmer, Harty, & Bell (2005)	Grades: K to 3	Screening measures: 0	409	206	203	4	WRMT-R: Word Identification, Word Attack, Passage Comprehension	3
	Language: NR	Eligibility criteria: N/A					PPVT-III
Osborn et al. (2007)	Grade: 2	Screening measures: 0	359	180	179	5	DIBELS: ORF	0
	Language: NR	Eligibility criteria: N/A					WJ-III: Letter/Word Identification, Reading Fluency, Passage Comprehension, Word Attack
Oudeans (2003) ^c	Grade: K	Screening measures: 1	55	28	27	6	DIBELS: Letter Naming Fluency, Onset Recognition Fluency, Phoneme Segmentation Fluency, Nonsense Word Fluency	2
	Language: NR	Eligibility criteria: Read 5 or fewer words on WRMT Word Identification					PPVT-R
							WRMT-R: Word ID
Ritter & Saxon (2011)	Grade: 1	Screening measures: 0	59	30	29	1	TPRI	1
	Language: None ELL	Eligibility criteria: N/A
Santoro, Coyne, & Simmons (2006)	Grade: K	Screening measures: 2	116	NR	NR	5	DIBELS: Letter Naming Fluency, Initial Sound Fluency	4
	Language: None significantly limited in English	Eligibility criteria: Scored lower than 25th percentile in district on both DIBELS fall Letter Naming Fluency and Initial Sound Fluency					PPVT-R
							WRMT-R: Word ID, Word Attack
Simmons et al. (2007)	Grade: K	Screening measures: 2	96	66	30	8	DIBELS: Letter Naming Fluency, Onset Recognition Fluency, Phoneme Segmentation Fluency, Nonsense Word Fluency	2
	Language: None ELL	Eligibility criteria: Scored at or below 25th percentile on DIBELS Letter Naming Fluency and Onset Recognition Fluency					WRMT-R: Word Attack and Word Identification
							PPVT-RYSTPS
Spencer & Manis (2010)	Grades: 6 to 8	Screening measures: 2	60	34	26	10	WRMT-R/NU: Word ID, Word Attack, Passage Comprehension	0
	Language: 24 ELL	Eligibility criteria: Scored below 63 (raw score of fourth grade equivalent) on WRMT-R Word Identification. If scored 63–78 on WRMT, administered GORT-III to determine those not able to read fluently at fifth-grade level					GORT-III
							WISC-IV: Similarities, Vocabulary
							TOWRE: Sight Word Efficiency, Phonemic Decoding Efficiency
							CTOPP: Rapid Letter Naming, Phoneme Elision
Therrien, Wickstrom, & Jones (2006)	Grades: 4, 5, 7 and 8	Screening measures: 1+ (unknown test to determine reading level)	29	15	14	2	DIBELS: Oral Reading Fluency	2
	Language: NR	Eligibility criteria: Identified as LD in reading via discrepancy model or identified at-risk if reading at least two grade levels below current placement. Students with reading levels below 1st or above 4th were excluded.					WJ-III: Broad Reading
Thompson & Davis (2002) ^c	Grade: 2	Screening measures: 1+ (unknown number of subtests on TPRI)	84	NR	NR	6+ (unknown number of subtests on TPRI)	TPRI	4
	Language: 40 ELL with English and Spanish proficiency ranging from negligible to fluent	Eligibility criteria: Students who received a “still developing” rating on the Texas Primary Reading Inventory					WLS
							WRMT-R: Word Attack, Passage Comprehension
							TORF
							DIBELS: Phoneme Segmentation Fluency
Torgesen et al. (2001) ^c	Ages 8 to 10	Screening measures: 3	60	30	30	16	CTOPP: Phoneme Elision, Nonword Repetition, Memory for Digits, Rapid Digit Naming, Rapid Letter Naming	0
	Language: None ELL	Eligibility criteria: 1.5+ standard deviations below mean on WRMT-R and below minimum grade performance on Lindamood Auditory Conceptualization Test. Full scale WISC > 75.					WRMT-R: Word ID, Word Attack, Passage Comprehension
							TOWRE: Phonemic Decoding Efficiency, Sight Word Efficiency
							GORT-IIIKTEA: Spelling
							WJ-R: Calculation
							CELF-3: Expressive and Receptive Language Skills
							LACWISC-R
Torgesen, Wagner, Rashotte, Herron, & Lindamood (2010)	Grade: 1	Screening measures: 5+ (unknown test to determine verbal IQ)	112	72	40	15+ (did not identify tests used to determine verbal IQ)	SBIT-4: Vocabulary Subtest	2
	Language: NR	Eligibility criteria: Performed in bottom 35% on letter-sound knowledge as determined by Stanford-Binet Vocabulary and CTOPP subtests (Phoneme Elision, Rapid Digit Naming, and Rapid Letter Naming). Had highest probability of reading difficulty based on combined score of CTOPP subtests. Verbal IQ > 75.					CTOPP: Phoneme Elision, Blending Words, Segmenting Words, Rapid Digit Naming, Rapid Letter Naming
							WRMT-R: Word Identification, Word Attack, Passage Comprehension
							TOWRE: Word Reading Efficiency, Phonemic Decoding Efficiency
							GORT-3: Text Reading Accuracy, Text Reading Fluency, Reading Comprehension
							WRAT-R: Spelling
Ukrainetz, Ross, & Harm (2009)	Grade: K	Screening measures: 2	41	28	13	8	PAT: First Phoneme Isolating, Last Phoneme Isolating, Phoneme Segmenting, Phoneme Blending	1
	Language: 22 ELL	Eligibility criteria: Below grade-level on both total DIBELS and DIBELS Initial Sound Fluency					TOPA—2nd Edition Plus
							TOSS
							DIBELS: Letter Naming Fluency
Vadasy, Sanders, & Peyton (2005)	Grade: 1	Screening measures: 1+	57	38	19	9	PPVT-IIICTOPP: Nonword Repetition	5
	Language: 34 ELL	Eligibility criteria: WRAT Reading score at or below 25th percentile; triads of students matched on the basis of “a pretest composite score calculated by averaging the z scores of all pretest scores” (p. 367)					WRAT-R: Reading, Spelling
							WRMT-R/NU: Word Attack, Word Identification
							TOWRE: Phonemic Decoding, Sight Word, Passage Comprehension
Vadasy, Sanders, Peyton, & Jenkins, (2002)	Grades: 1 to 2	Screening measures: 1	65	49	16	7	WRAT-R: Reading, Spelling	5
	Language: 24 ELL	Eligibility criteria: 90 or lower standard score (25th percentile) on WRAT Reading					PPVT-R
							WRMT-R: Word Identification, Word Attack
							TOWRE: Sight Word Efficiency, Phonemic Decoding
Vaughn et al. (2010)	Grade: 6	Screening measures: 1+ (unknown test to determine reading level)	576 (327 struggling readers + 249 typically achieving students)	212	115 comparison struggling readers + 249 typically achieving students	11	TAKS^e	2
	Language: NR	Eligibility criteria: Struggling reader as defined by TAKS scale score ≤ 2,150 or reading on a second grade level or lower.					WJ-III: Letter/word ID, Word Attack, and Passage Comprehension
							TOWRE: Phonemic Decoding Efficiency, Sight Word Efficiency
							AIMSweb: Maze^e
							TOSRE
							GRADE: Passage Comprehension Subtest^e
							KBIT-2: Matrices, Verbal Knowledge
Vaughn et al. (2006)	Grade: 1	Screening measures: 2	64	31	33	29	CTOPP: Phoneme Elision, Blending Words, Blending Nonwords, Segmenting Words, Sound Matching, Nonword Repetition, Rapid Letter Naming	5
	Language: all ELL (Spanish first language)	Eligibility criteria: Spanish speaking EL, scored below 25th percentile on WLPB-R or WLPB-R Spanish Letter Name Identification, and inability to read more than 1 word of 5 on experimental word reading list in Spanish					TOPP-S: Phoneme Elision, Blending Words, Blending Nonwords, Segmenting Words, Sound Matching, Nonword Repetition, Rapid Letter Naming
							WLPB-R: Word Attack, Passage Comprehension, Listening Comprehension, Picture Vocabulary, Verbal Analogies, Memory for Sentences, Letter Name Identification
							WLPB-R Spanish: Word Attack, Passage Comprehension, Listening Comprehension, Picture Vocabulary, Verbal Analogies, Memory for Sentences, Letter Name Identification
							DIBELS: Oral Reading Fluency
							IDEL: Oral Reading Fluency
Vaughn et al. (2011)	Grades: 7 to 8	Screening measures: 1	133	97	36	7	WJ-III: Letter Word ID, Word Attack, Spelling, Passage Comprehension	0
	Language: 21% treatment group ELL; 20% comparison group ELL	Eligibility criteria: TAKS scale score < 2,150					TOWRE: Phonemic Decoding Efficiency, Sight Word Efficiency
							TAKS^e
Vernon-Feagans et al. (2010)	Grades: K to 1	Screening measures: 2	183	76	107	3	WJ-III: Word Attack, Letter Word Identification	2
	Language: NR	Eligibility criteria: Below grade level on state mandated tests (phonological awareness, phonics, print awareness, fluency) and researcher developed screening instrument					PPVT-III
Wanzek, Vaughn, Roberts, & Fletcher (2011)	Grades: 6 to 8	Screening measures: 1	120	65	55	6	WJ-III: Letter/Word Identification, Word Attack, Passage Comprehension	0
	Language: NR	Eligibility criteria: Took state developed alternative assessment or had TAKS scale score below cut point or within one-half SEM above the passing score					TOWRE: Sight Word Efficiency, Phonemic Decoding
							TAKS^e
Wanzek & Roberts (2012)	Grade: 4	Screening measures: 1	87	64	23	6	GMRT: Comprehension, Vocabulary	0
	Language: 54 ELL	Eligibility criteria: At or below 25th percentile on Gates MacGinitie Reading Comprehension					WJ-III: Letter Word Identification, Word Attack, Oral/Listening Comprehension, Passage Comprehension

Note. N/A = not applicable. NR = not reported. CELF-3 = Clinical Evaluation of Language Fundamentals–3; CMERS = Comprehensive Monitoring of Early Reading Skills; CRMT = Connecticut Reading Mastery Test; CTOPP = Comprehensive Test of Phonological Processing; DIBELS = Dynamic Indicators of Basic Early Literacy Skills; DRA = Developmental Reading Assessment; EFL = English as a foreign language; ELL = English language learner; GMRT = Gates-MacGinitie Reading Comprehension Test; GORT-III = Gray Oral Reading Test 3rd Edition; GSRT = Gray Silent Reading Test; IDEL = Indicadores Dinamicos del Exito en la Lectura; KBIT = Kaufman Brief Intelligence Test; KBIT-2 = Kaufman Brief Intelligence Test 2nd Edition; KTEA = Kaufman Test of Educational Achievement; KTEA-II = Kaufman Test of Educational Achievement 2nd Edition; LAC = Lindamood Auditory Conceptualization Test; MAT = Metropolitan Achievement Test; MBSPR = Monitoring Basic Skills Progress Reading; PAT = Phonological Awareness Test; PPVT = Peabody Picture Vocabulary Test; PPVT-R = Peabody Picture Vocabulary Test Revised; PPVT-III = Peabody Picture Vocabulary Test 3rd Edition; RAN-RAS = Test of Rapid Automatic Naming and Rapid Altering Stimulus; RIST = Reynold’s Intellectual Screening Test; SBIT-4 = Stanford Binet Intelligence Test 4th Edition; SDRT = Stanford Diagnostic Reading Test; TAKS = Texas Assessment of Knowledge and Skills; TAALD = Test of Adolescent and Adult Language Development; TOLD-P:3 = Test of Language Development Primary 3rd Edition; TOLD-I:2 = Test of Language Development Intermediate 2nd Edition; TOPA = Test of Phonemic Awareness; TOPP-S = Test of Phonological Processing in Spanish; TORF = Test of Oral Reading Fluency; TOSS = Test of Semantic Skills-Primary; TOSRE = Test of Sentence Reading Efficiency; TOWRE = Test of Word Reading Efficiency; TOWS-3 = Test of Written Spelling 3rd Edition; TPRI = Texas Primary Reading Inventory; WISC-3 = Wechsler Intelligence Scale for Children 3rd Edition; WISC-4 = Wechsler Intelligence Scale for Children 4th Edition; WJ = Woodcock Johnson Tests of Cognitive Abilities; WJ-R = Woodcock Johnson Tests of Cognitive Abilities Revised; WJ-III = Woodcock Johnson Test of Cognitive Abilities 3rd Edition; WLPB = Woodcock Language Proficiency Battery; WLPB-R = Woodcock Language Proficiency Battery Revised; WLPB-R Spanish = Woodcock Language Proficiency Battery Revised Spanish Edition; WLS = Woodcock Language Survey; WRAT-3 = Wide Range Achievement Test 3rd Edition; WRMT = Woodcock Reading Mastery Test; WRMT-R = Woodcock Reading Mastery Test Revised; WRMT-R/NU = Woodcock Reading Mastery Test Revised/Normative Update; YSTPS = Yopp-Singer Test of Phoneme Segmentation.

Measures include screeners, pretests, posttests, interim/PM tests, and informal measures. Among the screeners, pre, post, interim/PM, there could be extant data. Informal measures refer to researcher-developed measures or those without reliability and validity data.

Study 2 coded here.

Lagged treatment design (control group received treatment after serving as control for first treatment group).

Longitudinal design.

Group administered measure.

Used growth curve modeling, and subsequently subjects served as their own control.

Student Characteristics

The search criteria allowed for studies conducted in Grades K–8, and within the final corpus, all-inclusive grades and grade equivalents (e.g., ages 5–13) were represented. Grades K–2 and 6–8 were included slightly more often than Grades 3 to 5 as shown in Figure 1. Note that studies may have had participants from multiple grade levels.

Figure 1.

Number of studies in which each grade level was included.

To better define the students with reading difficulties participating in the studies, we coded for participants’ first language. Eight studies did not include students who were English language learners (ELLs; Aaron, Joshi, Gooden, & Bentum, 2008; Calhoon, Sandow, & Hunter, 2010; Manset-Williamson & Nelson, 2005; Morris et al., 2012; Ritter & Saxon, 2011; Santoro, Coyne, & Simmons, 2006; Simmons et al., 2007; Torgesen et al., 2001), and three studies had only ELL participants. In two of those studies, students’ first language was Spanish (Gerber et al., 2004; Vaughn et al., 2006). In the third study, the students had a variety of first languages (Lovett et al., 2008). Another 17 studies included some ELLs. In one study, students’ first language was Spanish (Thompson & Davis, 2002), four studies had multiple first languages (Berninger, Abbott, Vermeulen, & Fulton, 2006; Cartledge, Yurick, Singh, Keyes, & Kourea, 2011; Kamps et al., 2007; Leafstedt, Richards, & Gerber, 2004), and 12 studies did not report the first languages of the ELLs. The remaining 18 studies did not provide any information on the language backgrounds of their participants.

Eligibility Criteria Related to Screening Measures

Individual subtests were counted separately when tabulating screening measures administered to determine student eligibility for participation in the study or placement into treatment/comparison groups. Nine studies did not include this use of screening measures for participant eligibility (Aaron et al., 2008; Allinder, Dunse, Brunken, & Obermiller-Krolikowski, 2001; Guthrie et al., 2009; Joshi, Dahlgren, & Boulware-Gooden, 2002; Kamps et al., 2007; Leafstedt et al., 2004; O’Connor, Fulmer, Harty, & Bell, 2005; Osborn et al., 2007; Ritter & Saxon, 2011). Among the remaining 37 studies, the number of screening measures described ranged from one (Berninger et al., 2006; Bhat, Griffin, & Sindelar, 2003; Denton, Wexler, Vaughn, & Bryan, 2008; Fagella-Luby & Wardwell, 2011; Gerber et al., 2004; Helf, Cooke, & Flowers, 2009; Oudeans, 2003; Vadasy, Sanders, Peyton, & Jenkins, 2002; Vaughn et al., 2011; Wanzek, Vaughn, Roberts, & Fletcher, 2011; Wanzek & Roberts, 2012) to eight (Manset-Williamson & Nelson, 2005). However, six studies did not clearly identify all tests or subtests administered, so their total number of screening measures is anticipated to be larger than the number listed in Table 1 (Therrien, Wickstrom, & Jones, 2006; Thompson & Davis, 2002; Torgesen, Wagner, Rashotte, Herron, & Lindamood, 2010; Vadasy, Sanders, & Peyton, 2005; Vaughn et al., 2010). Six studies relied on extant state assessment data obtained from the schools for eligibility determination (Berninger et al., 2006; Denton et al., 2010; Thompson & Davis, 2002; Vaughn et al., 2010; Wanzek et al., 2011).

Sample Size

The total number of student participants as well as the number of students assigned to the treatment and comparison conditions were determined by the final sample sizes reported for the end of the study. The total number ranged from a low of 20 (Mansett-Williams & Nelson, 2005) to a high of 576 (Vaughn et al., 2010), with a median of 86 students. Those assigned to the treatment condition often were administered more tests, such as additional pretests (e.g., Hook, Macaruso, & Jones, 2001) or progress monitoring (e.g., Leafstedt et al., 2004), that students in the comparison condition did not take. The number of treatment students ranged from 11 (Mansett-Williams & Nelson, 2005) to 279 (Morris et al., 2012), with a median of 47. Two studies did not clearly report the treatment and comparison sample sizes (Santoro et al., 2006; Thompson & Davis, 2002). One study employed growth curve modeling with an unreported number of students serving as their own controls (Morris et al., 2012).

Standardized Measures

Standardized measures were defined as publicly or commercially available instruments with norming data and an established technical adequacy. Each subtest of an assessment was counted separately when determining the total number of standardized measures administered as a screener, pretest, progress monitor, interim assessment, or posttest. Even if a measure was administered at more than one point during the study, it was only counted once in the tabulation of the overall total. The number of standardized measures administered ranged from one (Ritter & Saxon, 2011) to 29 (Vaughn et al., 2006), with a median of seven. Two studies did not clearly identify all subtests of a given measure (Thompson & Davis, 2002) or the particular verbal intelligence test administered (Torgesen et al., 2010), so the total number of standardized measures for these studies could be greater than that reported in Table 1.

One or more subtests of approximately 40 different standardized measures were administered across the 46 studies in the corpus. Assessments that were used most frequently (i.e., in 10 or more studies) included Comprehensive Test of Phonological Processing, Dynamic Indicators of Basic Early Literacy Skills, Peabody Picture Vocabulary Test, Test of Word Reading Efficiency, Woodcock Johnson Test of Achievement, and Woodcock Reading Mastery Test.

Informal Measures

Informal measures were defined as those that were researcher-developed, did not have established reliability and validity, or were not publicly available. Although studies were excluded if they only administered informal measures, 30 studies in the corpus administered one or more informal measures in addition to standardized measures. Therefore, we tabulated these measures to better define the breadth of testing conducted in studies, but we did not include the informal measures in the other analyses. The number of informal measures ranged from one (Allor & McCathren, 2004; Denton et al., 2010; Faggella-Luby & Wardwell, 2011; Hudson, Isakson, Richman, Lane, & Arriaza-Allen, 2011; Mathes & Babyak, 2001; Nelson, Benner, & Gonzalez, 2005; Ritter & Saxon, 2011; Ukrainetz, Ross, & Harm, 2009) to six (Gerber et al., 2004), with a median of two. Whenever a test was not clearly identified, it was counted as an informal measure. For example, Ritter and Saxon (2011) obtained an unspecified “reading fluency score” (p. 6) at posttest only. Although this could have been one of the 10 subtests of the Texas Primary Reading Inventory (TPRI) administered, the task was not clearly identified and was described as being used in addition to TPRI at posttest. Hence, it was counted as an informal measure.

Features of Assessment Fidelity

Based on previous work (Cummings et al., 2013; Reed & Sturges, 2012), we analyzed the 46 studies for features hypothesized to influence the fidelity with which assessments might be administered (e.g., tester expertise, training of testers, monitoring of administrations, whether the testers were blind to condition, alternating test forms, number of testing days, distractions in the environment, and verification of scoring). Table 2 presents the coding results using symbols to represent the completeness of the information provided. A solid circle (●) was used to indicate the study provided all (i.e., 100%) of the details indicated in the parentheses of the column label. A partially completed circle (◉) was used for studies that provided some but not all of the details for that feature, and an open circle (○) was used when none of the information on that feature of assessment fidelity was reported. In the sections below, we describe the results for each feature.

Table 2

Completeness in reporting assessment fidelity factors

Study	Test administrations conducted by research team (100% of measures)	Description of testers (role and number)	Initial training of testers (how trained, completed practice administration, met reliability standard)	Follow-up training of testers (calibration during testing window)	Monitoring of testers (100% of administrations)	Description of testing procedures (blind to condition, order of tests, form used, how many sessions for all tests)	Description of testing environment (where tested, distractions or others in room)	Verification of scoring (100% of data double scored; high interrater reliability)
Aaron, Joshi, Gooden, & Bentum (2008)	◉	◉	○	○	○	○	○	○
Allinder, Dunse, Brunken, & Obermiller-Krolikowski (2001)	◉	○	○	○	○	◉	○	○
Allor & McCathren (2004)	●	○	○	○	○	◉	○	○
Berninger, Abbott, Vermeulen, & Fulton (2006)	●	○	○	○	○	○	○	○
Bhat, Griffin, & Sindelar (2003)	●	○	○	○	○	○	○	○
Calhoon (2005)	◉	◉	○	○	○	◉	◉	◉
Calhoon, Sandow, & Hunter (2010)	◉	◉	○	○	○	◉	◉	◉
Cartledge, Yurick, Singh, Keyes, & Kourea (2011)	●	○	○	○	○	◉	◉	○
Case et al. (2010)	◉	◉	○	○	○	◉	○	○
Denton et al. (2010)	●	○	○	○	○	○	○	○
Denton, Wexler, Vaughn, & Bryan (2008)	●	◉	◉	○	○	◉	○	○
Faggella-Luby & Wardwell (2011)	◉	◉	○	○	○	◉	◉	◉
Gerber et al. (2004)	●	◉	◉	○	●	◉	◉	◉
Guthrie et al. (2009)	●	◉	○	○	◉	◉	◉	○
Helf, Cooke, & Flowers (2009)	●	○	○	○	○	◉	◉	○
Hook, Macaruso, & Jones (2001)	◉	○	○	○	○	◉	○	○
Hudson et al. (2011)	●	○	○	○	○	○	○	◉
Joshi, Dahlgren, & Boulware-Gooden (2002)	●	◉	○	○	○	◉	○	○
Kamps et al. (2007)	●	○	○	○	○	○	○	○
Kim et al. (2006)	●	◉	○	○	○	○	○	○
Leafstedt, Richards, & Gerber (2004)	●	◉	○	○	○	◉	◉	○
Lovett et al. (2008)	●	○	○	○	○	◉	○	○
Manset-Williamson & Nelson (2005)	◉	◉	○	○	○	◉	◉	○
Mathes & Babyak (2001)	●	○	○	○	○	◉	○	○
Morris et al. (2012)	●	○	○	○	○	◉	○	○
Nelson, Benner, & Gonzalez, (2005)	●	○	○	○	○	○	○	○
O’Connor, Fulmer, Harty, & Bell (2005)	●	◉	◉	○	◉	○	○	○
Osborn et al. (2007)	●	◉	○	○	○	○	◉	○
Oudeans (2003)	●	○	○	○	○	○	○	○
Ritter & Saxon (2011)	○	◉	○	○	○	○	○	○
Santoro, Coyne, & Simmons (2006)	●	○	○	○	○	○	○	○
Simmons et al. (2007)	●	○	○	○	○	○	○	○
Spencer & Manis (2010)	●	◉	◉	○	○	◉	○	○
Therrien, Wickstrom, & Jones (2006)	●	●	◉	○	○	◉	○	○
Thompson & Davis (2002)	◉	○	○	○	○	○	○	○
Torgesen et al. (2001)	◉	○	○	○	○	○	○	○
Torgesen, Wagner, Rashotte, Herron, & Lindamood, (2010)	●	○	○	○	○	○	○	○
Ukrainetz, Ross, & Harm (2009)	◉	◉	○	○	●	◉	◉	◉
Vadasy, Sanders, & Peyton (2005)	◉	○	◉	○	●	◉	○	○
Vadasy, Sanders, Peyton, & Jenkins, (2002)	●	○	○	○	○	○	○	○
Vaughn et al. (2010)	◉	○	○	○	○	○	○	○
Vaughn et al. (2006)	●	○	○	○	○	◉	○	○
Vaughn et al. (2011)	◉	◉	○	○	○	○	○	○
Vernon-Feagans et al. (2010)	●	◉	◉	○	○	◉	○	○
Wanzek, Vaughn, Roberts, & Fletcher (2011)	◉	○	○	○	○	○	○	○
Wanzek & Roberts (2012)	●	◉	◉	○	○	◉	○	○

Note. ●= the study provided all (i.e., 100%) of the details indicated in the parentheses of the column label; ◉ = the study provided some but not all of the details indicated in the parentheses of the column label; ○ = the study provided none of the details indicated in the parentheses of the column label.

Test Administrations Conducted by Research Team

Study authors were more likely to report whether the research team administered the tests than any other aspect of assessment fidelity coded. Of the 46 studies, 30 specified collecting all data. Another 15 indicated some of the data were archival from state assessments (Allinder et al., 2001; Faggella-Luby & Wardwell, 2011; Thompson & Davis, 2002; Vaughn et al., 2010; Vaughn et al., 2011; Wanzek et al., 2011), locally mandated tests (Case et al., 2010), special education records (Aaron et al., 2008), intelligence tests (Calhoon, 2005; Calhoon et al., 2010; Hook et al., 2001; Torgesen et al., 2001), or an unspecified testing requirement resulting in accessible records (Manset-Williamson & Nelson, 2005; Vadasy et al., 2005; Ukrainetz et al., 2009). In one study, the only data reportedly collected from a standardized measure was a state required assessment administered by classroom teachers under typical conditions (Ritter & Saxon, 2011).

Description of Testers

The majority of studies (n = 26) did not provide any information on the individuals who administered the assessments to study participants. Only one study reported all coded details of this assessment fidelity feature. Therrien et al. (2006) described having two graduate assistants administer all pre- and posttests, yielding a tester–student ratio of approximately 1:15. The information derived from the 28 studies with a partial description of the testers is provided by the type of detail coded.

Role of testers

When clearly reported for all measures administered (excluding the gathering of extant data), testers were described as playing various roles in the study. In six studies, the testers were all graduate students or graduate research assistants (Calhoon, 2005; Calhoon et al., 2010; Case et al., 2010; Denton et al., 2008; Osborn et al., 2007; Vernon-Feagans et al., 2010). Testers also were identified as being exclusively bilingual undergraduates (Gerber et al., 2004), the study authors (O’Connor et al., 2005), or classroom teachers (Guthrie et al., 2009; Ritter & Saxon, 2011). Unspecified research assistants or research team members were employed in three studies (Faggella-Luby & Wardwell, 2011; Spencer & Manis, 2010; Wanzek & Roberts, 2012). Another five studies had testers who served in a combination of roles: bilingual researcher or bilingual undergraduate assistant (Leafstedt et al., 2004); tutors, graduate students, or the principal investigators (Manset-Williamson & Nelson, 2005); study authors and research assistants (Kim et al., 2006; Ukrainetz et al., 2009); and graduate students or school psychologists (Aaron et al., 2008).

The remaining six studies in the corpus providing some information about this detail inconsistently described the role of the testers they employed. This included what may have only been a semantic change from researchers to educators on an individual posttest (Vaughn et al., 2011). Other differences seemed more substantive such as having the authors administer all pretests and one interim assessment, but not describing the testers for another interim measure and employing “two education majors and a field supervisor in teacher preparation at the university” for posttests (O’Connor et al., 2005, p. 445). In three studies, the testers were not reported for a particular screening measure (Denton et al., 2008), or for the pre- and posttests (Denton et al., 2010; Lovett et al., 2008). The testers that were identified in these studies included “research psychometrists” from a clinic (Lovett et al., 2008, p. 336) administering the screening measures or graduate students administering the pre- and posttests (Denton et al., 2008) and research assistants (Denton et al., 2010).

Number of testers

Rarely within the corpus was the number of testers employed clearly identified such that a calculation of the tester–student ratio could be made. With the exception of the extant data gathered, Case et al. (2010) maintained a very low tester–student ratio of approximately 1:8 by consistently using four testers. Another article described having the four teachers administer all assessments to about 15 students in each of their classes (Ritter & Saxon, 2011). Finally, one study employed different testers for different instruments and testing points, resulting in a variable tester–student ratio (O’Connor et al., 2005). For the pretests and one interim measure in this study, there were four testers or about 1 tester per 102 students. Information was not provided about the testers of the other interim measure. At posttest, there were three different testers, yielding a tester–student ratio of approximately 1:136.

In two studies providing some information about the examiners, the number administering tests could not be determined definitively. For example, different sections of one article seemed to indicate there were six, eight, or nine teachers who might have been involved in testing their students (Guthrie et al., 2009). Another article noted, “Testing was carried out by the three authors and three research assistants” (p. 90), but did not specify whether or not each person administered all tests (Ukrainetz et al., 2009). Therefore, the tester–student ratio for individually administered assessments is unknown, but the authors did indicate the group administered measures were delivered to three to nine students at time. When the group size exceeded three children, two testers were present in the room. Based on these figures, each tester would have monitored no more than five students at a time. Similarly, a final study providing partial information on the testers did not identify the total number of examiners for the individually administered measures, but the authors described conducting the group administered tests with five to seven of the 56 student participants at a time (Joshi et al., 2002).

Initial Training of Testers

Most studies (n = 38) did not report whether the testers were trained, completed practice administrations, or met a reliability standard. However, eight studies provided at least some information about the coded details. Half of these stated the testers were trained (Denton et al., 2008), “trained by the lead researcher” (Spencer & Manis, 2010, p. 78), trained “by graduate research assistants” (Gerber et al., 2004, p. 243), or trained “as part of their graduate training” (Therrien et al., 2006, p. 93). However, no other information about the training was provided. Somewhat more description was given in the other four studies, which all reported holding practice administrations. The practice might have been supervised (Vadasy et al., 2005), conducted with a partner following a script (Wanzek & Roberts, 2012), enacted in totality with nonparticipating students (Vernon-Feagans et al., 2010), or conducted on researchers role playing student behaviors that trigger basal and ceiling rules (O’Connor et al., 2005). Only Wanzek and Roberts (2012) reported the reliability standard (90%) the testers had to meet in their practice session.

Two studies provided an indication of the amount of time spent training examiners. Although the exact number of hours was not stated, the testers participated in two sessions (O’Connor et al., 2005) or 2 days of training (Vernon-Feagans et al., 2010). In the latter, the authors also described the examiners as having previous experience administering assessments. This was noted for the researchers in the O’Connor et al. (2005) study who conducted the midyear testing, but the posttesting was done by college students who were trained by the researchers. Other than offering practice scenarios, little other information was provided about the content of the training. Testers in one study were provided stopwatches and sample passages to practice scoring (O’Connor et al., 2005), and in another the trainers reportedly followed “the instructions in the user’s manual” and included a combination of explanation and modeling (Vadasy et al., 2005, p. 372).

Follow-Up Training of Testers

In none of the 46 studies identified for inclusion in this narrative synthesis was there an indication that testers received follow-up training during a given testing period. This would have included efforts to recheck reliability and calibrate testers as needed.

Monitoring of Testers

After follow-up training, this feature of assessment fidelity was least often reported. One study monitored testers’ first two administrations but not the remaining testing, which involved more than 400 students (O’Connor et al., 2005). Three studies with smaller sample sizes (n = 41–57) reported monitoring all administrations. This was done through the supervision of graduate research assistants (Gerber et al., 2004) or research staff (Vadasy et al., 2005), or the observations of other testers (Ukrainetz et al., 2009). In one study, teachers who were testing their students were assisted by researchers, but monitoring for fidelity was not reported (Guthrie et al., 2009). Although O’Connor et al. (2005) noted that no feedback was needed on the two administrations that were observed, no authors offered information on whether or not any anomalies occurred during an entire assessment window.

Description of Testing Procedures

No studies in the corpus provided information on all coded details about the testing procedures: whether testers were blind to participants’ conditions, the form used, or how many minutes or sessions were needed to administer all tests. However, 25 studies reported some of the details. Of these, nine studies described the procedures for all tests administered (Cartledge et al., 2011; Gerber et al., 2004; Guthrie et al., 2009; Helf et al., 2009; Joshi et al., 2002; Leafstedt et al., 2004; Lovett et al., 2008; Spencer & Manis, 2010; Vernon-Feagans et al., 2010). The remaining 13 studies described the procedures for some but not all of the administrations resulting in data used to identify participants, establish baseline performance, monitor students’ progress, or determine treatment effectiveness. Information reported is described in the sections below.

Blind to condition

Rarely did authors directly state whether or not the assessors were aware of the students’ assignment to treatment or comparison conditions. In two studies, it was noted that research assistants did not test students to whom they were delivering the intervention (Case et al., 2010) or were otherwise blind to students’ conditions (Vernon-Feagans et al., 2010). Another author identified all testers as being blind to condition at pretest, but not at mid- or posttest (Ukrainetz et al., 2009). However, researchers described precautions taken to detect bias such as having observers in situ and reviewing audiotaped administrations. Manset-Williamson and Nelson (2005) referred to blinding the condition when scoring students’ retell responses, but the tutors and principal investigators administered measures. Those individuals would have been aware of the instruction delivered to each student. Similarly, students in four studies were assessed by their classroom teachers (Guthrie et al., 2009; Ritter & Saxon, 2011), the study authors and research assistants (Kim et al., 2006), or study authors (O’Connor et al., 2005).

Testing forms

A variety of information was provided about the testing documents used in 12 studies. Most often, authors reported counterbalancing alternate forms, passages, or the English and Spanish versions of tests (Allinder et al., 2004; Allor & McCathren, 2004; Calhoon, 2005; Faggella-Luby & Wardwell, 2011; Gerber et al., 2004; Mathes & Babyak, 2001; Wanzek & Roberts, 2012). In three studies, different forms were used at different testing points (Allor & MacCathern, 2004; Cartledge et al., 2011; Hook et al., 2001), and one study reported using a set order in which one measure was always administered first and the other second in pre- and posttesting (Guthrie et al., 2009). When progress monitoring, Helf et al. (2009) described administering parallel forms of the subtests. The particular form used (e.g., Blue, A/B, G, short) was identified in five studies (Faggella-Luby & Wardwell, 2011; Lovett et al., 2008; Manset-Williamson & Nelson, 2005; Morris et al., 2012; Vaughn et al., 2006).

Students who were bilingual or limited English proficient might have been tested in both languages (Gerber et al., 2004; Vaughn et al., 2006) or only in English for standardized measures but in both English and Spanish with informal measures (Leafstedt et al., 2004). Gerber et al. (2004) also described delivering instruction in the students’ dominant language or in both English and Spanish if a dominant language could not be determined.

Other information provided by authors concerned the difficulty level of the form or passages administered. This was determined by teacher judgment in two studies (Allinder et al., 2004; Guthrie et al., 2009) and by the current grade placement of the student in two other studies (Calhoon et al., 2010; Faggella-Luby & Wardwell, 2011). Therrien et al. (2006) used passages at a slightly lower grade level than the students’ current placement. Occasionally discontinue or branching rules were used such that student performance on one or more measures determined whether they received additional tests or test items (Leafstedt et al., 2004; Vadasy et al., 2005; Vaughn et al., 2006)

Testing time

Seven studies clearly identified the number of minutes or days it took to administer all measures. The time was described as mostly completed in three 20-minute sessions (Gerber et al., 2004), occurring over a 2-day period (Guthrie et al., 2009), separated by 1-day intervals in between each of three assessments (Joshi at al., 2002), and completed in one 90-minute session (Faggella-Luby & Wardwell, 2011). Sometimes the testing time varied for different points or conditions. For example, Faggella-Luby and Wardwell (2011) noted that a schedule issue resulted in the second treatment group being tested on “consecutive days within five days of post-testing” the first treatment and comparison groups (p. 41). This description suggests the 90 minutes might have been distributed for this treatment group. Other variations included a 3-hour total administration time for the full test battery but about 45 minutes for the partial battery used with different treatment conditions (Hook et al., 2001); pretesting for 95 to 120 minutes spread over 2 to 3 days per student, and posttesting for 60 minutes in one sitting per student (Spencer & Manis, 2010); or giving individually administered tests in the same sitting, but holding separate sessions for the group administered tests lasting about 30 minutes each (Ukrainetz et al., 2009).

Although the amount of testing time in other studies might be estimated from the standard administration prescribed in the technical manuals of the standardized measures, some researchers employed a large enough battery of assessments that it is reasonable to expect the administrations would have taken more than one session (e.g., Aaron et al., 2008; Denton et al., 2010; Hook et al., 2001; Lovett et al., 2008; Morris et al., 2012; Torgesen et al., 2001; Torgesen et al., 2010; Vaughn et al., 2006; Vaughn et al., 2011). In addition, some authors described making adaptations of published testing procedures or materials, which would render independent calculations of time imprecise (e.g., Denton et al., 2010; Guthrie et al., 2009; Ukrainetz et al., 2009).

Description of Testing Environment

Details coded for this feature of assessment fidelity included the location of testing and the potential for distractions during the administrations. Of the 11 studies providing at least some information, most indicated in general terms where the testing occurred: at the students’ school (Gerber et al., 2004; Leafstedt et al., 2004; Osborn et al., 2007), in the regular classroom (Guthrie et al., 2009), in the school library (Faggella-Luby & Wardwell, 2011), in a separate room or hallway (Cartledge et al., 2011), or in partitioned classrooms or the school library (Manset-Williamson & Nelson, 2005). In one study, the location of the progress monitoring was identified, a small tutoring classroom at the school, but not the setting for pre- and posttesting (Helf et al., 2009). These general descriptions do not allow for inferring the potential distractions in the environment, such as might be caused by having other students present who were not being tested or other activities occurring in the vicinity. Two studies attempted to account for distractions by utilizing the quietest available space (Gerber et al., 2004) or a secluded area somewhere in the school (Ukrainetz et al., 2009). An additional two studies described the testing environment as a quiet or secluded, distraction-free room (Calhoon, 2005; Calhoon et al., 2010).

Verification of Scoring

Four of the 46 studies in the corpus provided information on double scoring or inter-rater reliability of assessment scoring. All protocols or answer documents were checked and rechecked by different individuals during both the scoring and data entry stages of two studies (Gerber et al., 2004; Hudson et al., 2011). However, interrater agreement was not reported in either article. Randomly selected samples of testing documents (11% to 20%) were double scored in two other studies with interrater reliabilities reported as 99.7% (Faggella-Luby & Wardwell, 2011) and 99.6% (Ukrainetz et al., 2009). Ukrainetz et al. were the only authors of the 45 studies utilizing individually administered, standardized measures to report interrater agreement determined in double scoring assessment packets from those measures (as opposed to training protocols as done in the Wanzek et al., 2012, study). This rating does not include studies reporting interrater agreement of informal measures, which were not the primary focus of this narrative synthesis (e.g., Kim et al., 2006; Mansett-Williams & Nelson, 2005).

Despite the paucity of information on score verification, a number of studies described scoring procedures that might be prone to error such as generating composite scores from a number of tests or subtests (e.g., Bhat et al., 2003; Morris et al., 2012; Osborn et al., 2007; Vaughn et al., 2006). In addition, Spencer and Manis (2010) described removing an outlier from their analysis because “due to a scoring anomaly, his pretest data was [sic] found to grossly underestimate his reading ability, causing inflated gain scores at posttesting” (p. 80). Two studies reported using a computer program to sum raw scores and convert the scores from individual subtests to age-based standard scores (Calhoon, 2005; Calhoon et al., 2010).

Discussion

In examining the extent to which assessment fidelity is reported in K–8 reading intervention research, we iteratively coded 46 studies meeting inclusion criteria and employed a systematic process for extracting and categorizing relevant information. Unfortunately, none of the studies we reviewed reported sufficient information for determining whether all screening, pre-, interim, and postmeasures were administered and scored as intended. We wish to emphasize this does not mean that insufficient training, monitoring, blinding, calibrating, or double scoring occurred. Rather, our coding revealed little assessment fidelity data were described by the study authors, thus precluding us from drawing conclusions about whether tests were administered as intended consistently. As has been previously suggested, the absence of integrity data may be related to journals placing a lower priority on such information or having limited space to publish it (Moncher & Prinz, 1991; Perepletchikova, Treat, & Kazdin, 2007).

It is interesting to note that only 11 of the 46 studies (24%) included in this review did not overtly state whether implementation fidelity was monitored (Aaron et al., 2008; Allinder et al., 2001; Berninger et al., 2006; Bhat et al., 2003; Hook et al., 2001; Leafstedt et al., 2004; O’Connor et al., 2005; Osborn et al., 2007; Ritter & Saxon, 2011; Santoro et al., 2006; Torgesen et al., 2001). In the decade since the passing of the No Child Left Behind Act of 2001, the field seems to have acknowledged the importance of accounting for the delivery of treatments as intended; however, that concern has not extended to the tests administered to evaluate the effects of those treatments. Rather, there appears to be an assumption that the measures are always used in an error-free fashion. Unfortunately, several characteristics of the studies reviewed would suggest this might not be a safe supposition. Because our findings indicate there is a potential for construct irrelevant variance in the measurement of study participants, we discuss the results within a framework derived from Haladyna and Downing’s (2005) taxonomy of construct irrelevant sources of variance.

Unethical Test Preparation

Fifteen studies in our corpus utilized at least some extant data from state assessments, and one study (Ritter & Saxon, 2011) only used state required assessments for determining treatment effects. This practice is likely efficient for both the school sites and researchers in that it leverages available data from valid and reliable instruments and reduces the amount of lost instructional time for student participants. But this practice has the unfortunate side effect of rendering the researchers unable to account for the conditions under which those data were collected. If the business-as-usual condition in the schools involved teaching to the test (Pedulla et al., 2003; Tanner, 2013) or practices that otherwise compromised test security (Wollack & Fremer, 2013), the researchers would unknowingly be making flawed decisions about participant eligibility or outcomes.

Test Administration

Little data were reported about the individuals who were administering the measures to students, their training and expertise, how many were involved in testing, how they may have been monitored, the duration of all test administrations, or whether examiners’ training was refreshed in an extended testing timeframe. Only five studies did not use measures (e.g., Kaufman Brief Intelligence Test; Wechsler Intelligence Scale for Children; Woodcock Johnson Reading Mastery Test; Woodcock Johnson Tests of Achievement) that require documentation of professional preparation (Fagella-Luby & Wardwell, 2011; Helf et al., 2009; Nelson et al., 2005; Ritter & Saxon, 2011; Ukrainetz et al., 2009). That is not to suggest the study authors were not capable of adequately preparing their examiners, but that more onus might be placed on the rigor of training provided. After all, previous research has found students in graduate psychology, diagnostician, and counseling programs have difficulty achieving precision in administering these measures (Loe et al., 2007; Ramos et al., 2009).

Moreover, most studies reviewed here administered a large number of standardized tests and subtests (range = 1–29; median = 7) and may have also included informal measures (range = 1–6; median = 2). Nine studies used alternate forms, passages, or language versions in some sequence or with some students (Allinder et al., 2004; Allor & McCathren, 2004; Calhoon, 2005; Cartledge et al., 2011; Faggella-Luby & Wardwell, 2011; Gerber et al., 2004; Hook et al., 2001; Mathes & Babyak, 2001; Wanzek & Roberts, 2012), and three studies included discontinue or branching rules that would have altered the testing battery from student to student (Leafstedt et al., 2004; Vadasy et al., 2005; Vaughn et al., 2006). These procedures would have added to the complexity of learning and accurately administering the measures employed.

Despite how many rules and procedures testers would have needed to learn, only four studies specifically stated that practice sessions were held (O’Connor et al., 2005; Vadasy et al., 2005; Vernon-Feagans et al., 2010; Wanzek & Roberts, 2012), and only Wanzek and Roberts (2012) offered the reliability standard (90%) that testers had to meet in their preparation. In no studies was there an indication that the reliability of testers was checked during testing to ensure they had not deviated from protocol. O’Connor verified examiners’ first two administrations, but not again during the testing of more than 400 students. Three studies acknowledged supervising or observing the test administrations (Gerber et al., 2004; Ukrainetz et al., 2009; Vadasy et al., 2005) but not for the purposes of documenting integrity as might be done with implementation of interventions (Gersten et al., 2005). The potential for deviations to occur over the duration of implementation has been noted for intervention research, even when having demonstrated initial proficiency in the treatment (O’Donnell, 2008; Perepletchikova & Kazdin, 2005); thus, it is also reasonable to assume deviations might happen when administering assessments to many students over hours, days, and weeks. In fact, Waterman et al. (2011) found an increase in assessor variance over successive administrations of the same tests.

The total number of students in a given study was as high as 576 (Vaughn et al., 2010), with a median of 89 students. Rarely was it possible to determine how many students each examiner tested over what duration. From what was reported, the lowest tester–student ratio was 1:8 for five individually administered standardized measures (Case et al., 2010). The highest calculable ratio was 1:102 at pretest and 1:136 at posttest with three individually administered standardized measures, three individually administered informal measures, and one group administered standardized measure (O’Connor et al., 2005). Hence, an examiner who was reliable at the start of the testing might have made small lapses or in-the-moment decisions that gradually and cumulatively affected the way the measures were used. Significant variance in scores has been found with even small changes to how directions and prompts are delivered (Colón & Kranzler, 2006; Reed & Petscher, 2012) or how much time is provided to students (Derr-Minneci & Shapiro, 1992), so ongoing fidelity monitoring and training refreshers might help to reduce systematic error.

Recommendations for ensuring the quality of a data set have not only included adherence to standardized procedures, but also the elimination of environmental distractions (Christ et al., 2012). Three of the 46 studies referred to using a secluded or distraction-free location (Calhoon, 2005; Calhoon et al., 2010; Ukrainetz et al., 2009), and one described the environment as the quietest space available (Gerber et al., 2004). This choice of words reflects the reality of conducting research in natural school settings where testing has to occur in the best of what might be less than ideal location options. All but one study reviewed here utilized individually administered assessments that would necessitate hearing the student clearly in order to score responses (Fagella-Luby & Wardwell, 2011). Having a quiet space would be particularly important but may not be within the researchers’ ability to control. As an issue affecting the quality of the data to be analyzed, it warrants more attention when preparing for a study and negotiating the logistics with the school personnel.

Test Scoring

The nearly universal employment of individually administered assessments across the corpus also has implications for scoring errors, which have reportedly affected as much as 91% of a data set (Reed & Sturges, 2012). Although four studies explicitly described double scoring all (Gerber et al., 2004; Hudson et al., 2011) or a subset of testing documents (Fagella-Luby & Wardwell, 2011; Ukrainetz et al., 2009), none stated that raters were calibrated within the timeframe they were completing their work. Scoring in situ has been problematic for the types of measures commonly used in the studies reviewed due, in part, to the complexity of the scoring rules to be followed (Charter et al., 2000; Loe et al., 2007; Ramos et al., 2009). Similarly, significant within-rater variability has been found when scoring essays over time (Lamprianou, 2006; Myford & Wolf, 2009) and for raters with different levels of experience (Eckes, 2008; Leckie & Baird, 2011). Periodic calibration of scorers and double scoring 100% of testing documents would better protect the quality of the data used to determine participant eligibility and evaluate treatment effects.

Individual Cheating

A quality indicator for intervention research that is directly related to assessment fidelity is utilizing testers who are blind to the study participants’ conditions (Gearing et al., 2011; Gersten et al., 2005). This practice is to safeguard against bias, particularly when individually administering measures and hand scoring documents. Based on the role of testers in the studies, it was apparent they were often aware which students were in which treatment groups. This does not necessarily mean the testers were biased, but it would place greater importance on monitoring assessment fidelity.

Implications

The low reporting of assessment fidelity in the extant literature was not unexpected given the general lack of integrity data in social science research (Gearing et al., 2011; O’Donnell, 2008). Additionally, assessment fidelity has not been emphasized in efforts since 2001 to improve the rigor of educational research (e.g., National Research Council, 2002), in spite of awareness that reliability of tests in the social sciences is situational rather than an invariant property. Within our corpus, testing integrity (or testing variability) might offer explanation or possible insights into the scoring anomaly found by Spencer and Manis (2010) on one of the eight individually administered subtests of a student’s oral reading.

Variability in measure administration also might present an alternative explanation for why Vaughn et al. (2010) detected only small effects that were more apparent in particular subgroups (e.g., implementation site, participant age, or participant pretest levels) after their year-long, comprehensive intervention. This study included the largest number of participants (n = 576) and administered 11 standardized and two informal measures. The authors attributed the small effects to issues such as the provision of a primary intervention or the time it takes for adolescents to improve their reading abilities. However, the complexity of using so many outcome measures, with the potential for deviations to testing procedures over time, multiplied across the sheer number of students each examiner assessed could have threatened the reliability of the assessments.

Stockard (2010) argued that lower and higher levels of fidelity do not average out in the aggregate but, rather, “produce results that are systematically biased” (p. 9). Thus, a lack of assessment fidelity obscures the true value of the effect size such that “with poor implementation, the good programs would be less good and bad programs would be less bad” (Stockard, 2010, p. 10). Although it is possible that tester variance merely affected findings for the studies in this corpus, results from a recent study indicate observed variability around students’ true scores (i.e., 16%) is attributable to examiners (Cummings, Biancarosa, Schaper, & Reed, 2013). Hence, it is our position that data on assessment fidelity should be reported and considered when interpreting results of research. Doing so would document evidence of experimental control and add to the credibility of findings.

Limitations and Directions for Future Research

To conduct our review, we had to develop a coding system based on features that reasonably could be expected to influence the accurate administration and scoring of measures. As the first instance of doing this type of synthesis, there was not an existing standard for assessment fidelity we could apply. Moreover, information in which we were interested was not well contained in one section of the articles or well labeled when present. These realities made coding difficult, so we proceeded iteratively and systematically. That noted, developing clear standards for reporting testing integrity will improve the reliability of future attempts to examine the presence, quality, and impact of the information.

This synthesis relied on what was reported in published findings from K–8 reading intervention research, which does not necessarily capture the universe of assessment fidelity safeguards actually used. Because implementation fidelity is often not considered when determining treatment effectiveness (Stockard, 2010; Swanson et al., 2011), there likely is little impetus for consuming limited space in an article with the minutiae of testing procedures. Future research might directly query researchers to determine what kinds of training, monitoring, calibrating, and data checking or double scoring were enacted as well as what kinds of problems, if any, were experienced. Studies might also explore whether aspects of training and administration can be manipulated to improve assessment fidelity in both highly controlled research studies as well as more naturalistic environments where teachers test their own students for instructional purposes. Ultimately, protecting testing integrity should help us better understand student reading performance and make more accurate and precise decisions about their responsiveness to reading intervention.

Footnotes

Authors

DEBORAH K. REED, PhD, holds a joint appointment with the School of Teacher Education and the Florida Center for Reading Research at Florida State University, 2010 Levy Ave., Ste 100, Tallahassee, FL 32308; e-mail: dkreed@fcrr.org . Her research interests include the use of reading assessments in data-based decision making and effective reading instruction for vulnerable populations such as adolescent English language learners and those in the juvenile justice system.

KELLI D. CUMMINGS, PhD, NCSP, is a faculty research associate at the University of Oregon, Center on Teaching and Learning, 5292 University of Oregon, Eugene, OR 97403; e-mail: kellic@uoregon.edu . Her research focuses on projects that link assessment and intervention technologies to improve student outcomes. She has also provided technical assistance and training on problem-solving, response-to-intervention, and the use of curriculum-based measurement for schools in the United States, Canada, and Great Britain.

ANDREW SCHAPER began his career in education as a middle and high school English teacher in San Francisco, which influences his applied research interests in school improvement. His methodological research interests center on multilevel statistical modeling techniques and incorporating Bayesian estimation into the modeling of school predictors on student outcomes. He is a graduate research fellow at the Center on Teaching and Learning, 5292 University of Oregon, Eugene, OR 97403; e-mail: schaper@uoregon.edu .

GINA BIANCAROSA, EdD, is an assistant professor in the Department of Educational Methodology, Policy, and Leadership at the University of Oregon’s College of Education, Lokey Education Building, Room 102R, Eugene, OR 97403; e-mail: ginab@uoregon.edu . Her research focuses on measurement in reading, including oral reading fluency.

References

*Aaron

P. G.

Joshi

M. R.

Gooden

Bentum

K. E.

(2008). Diagnosis and treatment of reading disabilities based on the component model of reading: An alternative to the discrepancy model of LD. Journal of Learning Disabilities, 41, 67–84. doi:10.1177/0022219407310838

*Allinder

R. M.

Dunse

Brunken

C. D.

Obermiller-Krolikowski

H. J.

(2001). Improving fluency in at-risk readers and students with learning disabilities. Remedial and Special Education, 22, 48–45.

Allor

J. H.

Mathes

P. G.

Roberts

J. K.

Cheatham

J. P.

Champlin

T. M.

(2010). Comprehensive reading instruction for students with intellectual disabilities: Findings from the first three years of a longitudinal study. Psychology in the Schools, 47, 445–466. doi:10.1002/pits.20482

*Allor

J. H.

McCathren

(2004). The efficacy of an early literacy tutoring program implemented by college students. Learning Disabilities Research & Practice, 19, 116–129.

Amrein-Beardsley

Berliner

D. C.

Rideau

(2010). Cheating in the first, second, and third degree: Educators’ responses to high-stakes testing. Education Policy Analysis Archives, 18(14), 1–36. Retrieved from http://www.eric.ed.gov/PDFS/EJ895618.pdf

Baker

(2013, May 10). New error found in scoring of test for gifted programs. New York Times. Retrieved from http://www.nytimes.com/2013/05/11/education/new-error-found-in-test-scoring-for-gifted-programs.html

Benner

G. J.

Nelson

J. R.

Stage

S. A.

Ralston

N. C.

(2011). The influence of fidelity of implementation on the reading outcomes of middle school students experiencing reading difficulties. Remedial and Special Education, 32, 79–88.

*Berninger

V. W.

Abbot

R. D.

Vermeulen

Fulton

C. M.

(2006). Paths to reading comprehension in at-risk second-grade readers. Journal of Learning Disabilities, 39, 334–351.

*Bhat

Griffin

C. C.

Sindelar

P. T.

(2003). Phonological awareness instruction for middle school students with learning disabilities. Learning Disability Quarterly, 26, 73–87.

10.

*Calhoon

M. B.

(2005). Effects of a peer-mediated phonological skill and reading comprehension program on reading skill acquisition for middle school students with reading disabilities. Journal of Learning Disabilities, 38, 424–433.

11.

Calhoon

M. B.

Otaiba

S. A.

Cihak

King

Avalos

(2007). Effects of a peer-mediated program on reading skill acquisition for two-way bilingual first-grade classrooms. Learning Disability Quarterly, 30, 169–184.

12.

*Calhoon

M. B.

Sandow

Hunter

C. V.

(2010). Reorganizing the instructional reading components: Could there be a better way to design remedial reading programs to maximize middle school students with reading disabilities’ response to treatment? Annals of Dyslexia, 60, 57–85. doi:10.1007/s11881-009-0033-x

13.

*Cartledge

Yurick

Singh

A. H.

Keyes

S. E.

Kourea

(2011). Follow-up study of the effects of a supplemental early reading intervention on the reading/disability risk of urban primary learners. Exceptionality: A Special Education Journal, 19, 140–159. doi:10.1080/09362835.2011.562095

14.

*Case

L. P.

Speece

D. L.

Silverman

Ritchey

K. D.

Schatschneider

Cooper

D. H.

. . . Jacobs

(2010). Validation of a supplemental reading intervention for first-grade children. Journal of Learning Disabilities, 43, 402–417. doi:10.1177/0022219409355475

15.

Century

Freeman

Rudnick

(2008. March). A framework for measuring and accumulating knowledge about fidelity of implementation (FOI) of science instructional materials. Chicago, IL: University of Chicago. Retrieved from http://cemse.uchicago.edu/research-and-evaluation/research/foi/narst-framework.pdf

16.

Charter

R. A.

Walden

D. K.

Padilla

S. P.

(2000). Too many simple scoring errors: The Rey figure as an example. Journal of Clinical Psychology, 56, 571–574. doi:10.1002/(SICI)1097-4679(200004)56:4<571::aid-jclp10>3.0.co;2-6

17.

Christ

T. J.

Zopluoglu

Long

Monaghen

(2012). Curriculum-based measurement of oral reading: Quality of progress monitoring outcomes. Exceptional Children, 78, 356–373.

18.

Colón

E. P.

Kranzler

J. H.

(2006). Effect of instructions on curriculum-based measurement of reading. Journal of Psychoeducational Assessment, 24, 318–328. doi:10.1177/0734282906287830

19.

Cummings

Biancarosa

Schaper

Reed

D. K.

(2013). Examiner error in curriculum-based measurement of oral reading. Manuscript submitted for publication.

20.

Dane

A. V.

Schneider

B. H.

(1998). Program integrity in primary and early secondary prevention: Are implementation effects out of control? Clinical Psychology Review, 18, 23–35.

21.

*Denton

C. A.

Nimon

Mathes

P. G.

Swanson

E. A.

Ketheley

Kurz

T. B.

Shih

(2010). Effectiveness of a supplemental early reading intervention scaled up in multiple schools. Exceptional Children, 76, 394–416.

22.

*Denton

C. A.

Wexler

Vaughn

Bryan

(2008). Intervention provided to linguistically diverse middle school students with severe reading difficulties. Learning Disabilities Research & Practice, 23, 79–89.

23.

Derr

T. F.

Shapiro

E. S.

(1989). A behavioral evaluation of curriculum-based assessment of reading. Journal of Psychoeducational Assessment, 7, 148–160. doi:10.1177/073428298900700205

24.

Derr-Minneci

T. F.

Shapiro

E. S.

(1992). Validating curriculum-based measurement in reading from a behavioral perspective. School Psychology Quarterly, 7, 2–16. doi:10.1037/h0088244

25.

DiCecco

V. M.

Gleason

M. M.

(2002). Using graphic organizers to attain relational knowledge from expository texts. Journal of Learning Disabilities, 35, 306–310.

26.

Eckes

(2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25, 155–185. doi:10.1177/0265532207086780

27.

*Faggella-Luby

Wardwell

(2011). RTI in a middle school: Findings and practical implications of a tier 2 reading comprehension study. Learning Disability Quarterly, 34, 35–49.

28.

Flynn

L. J.

Zheng

Swanson

H. L.

(2012). Instructing struggling older readers: A selective meta-analysis of intervention research. Learning Disabilities Research & Practice, 27, 21–32. doi:10.1111/j.1540-5826.2011.00347.x

29.

Foorman

B. R.

Moats

L.C.

(2004). Conditions for sustaining research-based practices in early reading instruction. Remedial and Special Education, 25(1), 51–60.

30.

Gearing

R. E.

El-Bassel

Ghesquiere

Baldwin

Gillies

Ngeow

(2011). Major ingredients of fidelity: A review and scientific guide to improving quality of intervention research implementation. Clinical Psychology Review, 31, 79–88. doi:10.1016/j.cpr.2010.09.007

31.

*Gerber

Jimenez

Leafstedt

Villacruz

Richards

English

(2004). English reading effects of a small-group intensive intervention is Spanish for k–1 English learners. Learning Disabilities Research & Practice, 19, 239–251.

32.

Gersten

Fuchs

L. S.

Compton

Coyne

Greenwood

Innocenti

M. S.

(2005). Quality indicators for group experimental and quasi-experimental research in special education. Exceptional Children, 71, 149–164.

33.

Good

R. H.

Kaminski

R. A.

(2002). Dynamic indicators of basic early literacy skills (6th ed.). Eugene, OR: Institute for the Development of Educational Achievement.

34.

Gresham

F. M.

MacMillan

D. L.

Beebe-Frankenberger

M. E.

Bocian

K. M.

(2000). Treatment integrity in learning disabilities intervention research: Do we really know how treatments are implemented? Learning Disabilities Research & Practice, 15, 198–205.

35.

*Guthrie

J. T.

McRae

Coddington

C. S.

Klauda

S. L.

Wigfield

Barbosa

(2009). Impacts of comprehensive reading instruction on diverse outcomes of low- and high-achieving readers. Journal of Learning Disabilities, 42, 195–214. doi:10.1177/0022219408331039

36.

Haladyna

T. M.

Downing

S. M.

(2005). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice, 23, 17–27. doi:10.1111/j.1745-3992.2004.tb00149.x

37.

*Helf

Cooke

N. L.

Flowers

C. P.

(2009). Effects of two grouping condition on students who are at risk for reading failure. Preventing School Failure, 53, 113–126.

38.

*Hook

P. E.

Macaruso

Jones

(2001). Efficacy of FastForWord training on facilitating acquisition of reading skills by children with reading difficulties: A longitudinal study. Annals of Dyslexia, 51, 75–96.

39.

*Hudson

R. F.

Isakson

Richman

Lane

H. B.

Arriaza-Allen

(2011). An examination of a small-group decoding intervention for struggling readers: Comparing accuracy and automaticity criteria. Learning Disabilities Research & Practice, 26, 15–27.

40.

*Joshi

M. R.

Dahlgren

Boulware-Gooden

(2002). Teaching reading in an inner city school through a multisensory teaching approach. Annals of Dyslexia, 52, 229–242.

41.

*Kamps

Abbot

Greenwood

Arreaga-Mayer

Wills

Longstaff

. . . Walton

(2007). Use of evidence-based, small group reading instruction for English language learners in elementary grades: Secondary-tier intervention. Learning Disability Quarterly, 30, 153–168.

42.

*Kim

Vaughn

Klinger

J. K.

Woodruff

A. L.

Reutebuch

C. K.

Kouzekanani

(2006). Improving the reading comprehension of middle school students with disabilities through computer-assisted collaborative strategic reading. Remedial and Special Education, 27, 235–249.

43.

Krippendorff

(2004). Content analysis: An introduction to its methodology (2nd ed.). Thousand Oaks, CA: Sage.

44.

Lamprianou

(2006). The stability of marker characteristics across tests of the same subject and across subjects. Journal of Applied Measurement, 7, 192–205.

45.

*Leafstedt

J. M.

Richards

C. R.

Gerber

M. M.

(2004). Effectiveness of explicit phonological-awareness instruction for at-risk English learners. Learning Disabilities Research & Practice, 19, 252–261.

46.

Leckie

Baird

J. A.

(2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Journal of Educational Measurement, 48, 399–418. doi:10.1111/j.1745-3984.2011.00152.x

47.

Loe

S. A.

Kadlubek

R. M.

Marks

W. J.

(2007). Administration and scoring errors on the WISC-IV among graduate student examiners. Journal of Psychoeducational Assessment, 25, 237–247. doi:10.1177/0734282906296505

48.

*Lovett

M. W.

De Palma

Frijters

Steinbach

Temple

Benson

Lacerenza

(2008). Interventions for reading difficulties: A comparison of response to intervention by ELL and EFL struggling readers. Journal of Learning Disabilities, 41, 333–352. doi:10.1177/0022219408317859

49.

*Manset-Williamson

Nelson

J. M.

(2005). Balanced, strategic reading instruction for upper-elementary and middle school students with reading disabilities: A comparative study of two approaches. Learning Disability Quarterly, 28, 59–74.

50.

*Mathes

P. G.

Babyak

A. E.

(2001). The effect of peer-assisted literacy strategies for first-grade readers with and without additional mini-skills lessons. Learning Disabilities Research & Practice, 16, 28–44.

51.

McIntyre

L. L.

Gresham

F. M.

DiGennaro

F. D.

Reed

D. D.

(2007). Treatment integrity of school-based interventions with children in the Journal of Applied Behavior Analysis 1991–2005. Journal of Applied Behavior Analysis, 40, 659–672.

52.

Metz

(2007, September 1). Florida will omit vital NCLB information due to scoring error. Heartland Institute Newsletter. Retrieved from http://news.heartland.org/newspaper-article/2007/09/01/florida-will-omit-vital-nclb-information-due-scoring-error

53.

Moncher

F. J.

Prinz

R. J.

(1991). Treatment fidelity in outcome studies. Clinical Psychology Review, 11, 247–266.

54.

*Morris

R. D.

Lovett

M. W.

Wolf

Sevcik

R. A.

Steinbach

K. A.

Frijiters

J. C.

Shapiro

M. B.

(2012). Multiple-component remediation for developmental reading disabilities: IQ, socioeconomic status, and race as factors in remedial outcome. Journal of Learning Disabilities, 45, 99–127. doi:10.1177/0022219409355472

55.

Myford

C. M.

Wolfe

E. W.

(2009). Monitoring rater performance over time: A framework for detecting differential accuracy and differential scale category use. Journal of Educational Measurement, 46, 371–389. doi:10.1111/j.1745-3984.2009.00088.x

56.

National Research Council. (2002). Scientific research in education: Committee on scientific principles for education research ( Shavelson

R. J.

Towne

, Eds.). Washington, DC: National Academies Press.

57.

*Nelson

J. R.

Benner

G. J.

Gonzalez

(2005). An investigation of the effects of a pre-reading intervention on the early literacy skills of children at risk of emotional disturbances and reading problems. Journal of Emotional and Behavioral Disorders, 13, 3–12.

58.

Nelson

J. N.

Manset-Williamson

(2006). The impact of explicit, self-regulatory reading comprehension strategy instruction on the reading-specific self-efficacy, attributions, and affect of students with reading disabilities. Learning Disability Quarterly, 29, 213–230.

59.

No Child Left Behind Act of 2001, 20 U.S.C. § 6319 (2008).

60.

Nolen

S. B.

Haladyna

T. M.

Haas

N. S.

(1992). Uses and abuses of achievement test scores. Educational Measurement: Issues and Practices, 11, 9–15. doi:10.1111/j.1745-3992.1992.tb00234.x

61.

*O’Connor

R. E.

Fulmer

Harty

K. R.

Bell

K. M.

(2005). Layers of reading intervention in kindergarten through third grade: Changes in teaching and student outcomes. Journal of Learning Disabilities, 38, 440–455.

62.

O’Donnell

C. L.

(2008). Defining, conceptualizing, and measuring fidelity of implementation and its relationship to outcomes in K–12 curriculum intervention research. Review of Educational Research, 78, 33–84. doi:10.3102/0034654307313793

63.

*Osborn

Freeman

Burley

Wilson

Jones

Rychener

(2007). Effect of tutoring on reading achievement for students with cognitive disabilities, specific learning disabilities, and students receiving title 1 services. Education and Training in Developmental Disabilities, 42, 467–474.

64.

*Oudeans

M. K.

(2003). Integration of letter-sound correspondences and phonological awareness skills of blending and segmenting: A pilot study examining the effects of instructional sequence on word reading for kindergarten children with low phonological awareness. Learning Disability Quarterly, 26, 258–280.

65.

Pedulla

J. J.

Abrams

L. M.

Madaus

G. F.

Russell

M. K.

Ramos

M. A.

Miao

(2003, March). Perceived effects of state-mandated testing programs on teaching and learning: Findings from a national survey of teachers. Boston, MA: Boston College, National Board on Educational Testing and Public Policy. Retrieved from http://www.bc.edu/research/nbetpp/statements/nbr2.pdf

66.

Perepletchikova

Kazdin

A. E.

(2005). Treatment integrity and therapeutic change: Issues and research recommendations. Clinical Psychology: Science and Practice, 12, 365–383. doi: 10.1093/clipsy.bpi045

67.

Perepletchikova

Treat

T. A.

Kazdin

A. E.

(2007). Treatment integrity in psychotherapy research: Analysis of studies and examination of associated factors. Journal of Consulting and Clinical Psychology, 75, 829–841.

68.

Ramos

Alfonso

V. C.

Schermerhorn

S. M.

(2009). Graduate students’ administration and scoring errors on the Woodcock-Johnson III tests of cognitive abilities. Psychology in the Schools, 46, 650–657. doi:10.1002/pits.20405

69.

Reed

D. K.

Petscher

(2012). The influence of testing prompt and condition on middle school students’ retell performance. Reading Psychology, 33, 562–585. doi:10.1080/02702711.2011.557333

70.

Reed

D. K.

Sturges

K. M.

(2012). An examination of assessment fidelity in the administration and interpretation of reading tests. Remedial and Special Education, 34, 259–268. doi:10.1177/0741932512464580

71.

*Ritter

M. J.

Saxon

T. F.

(2011). Classroom-based phonological sensitivity intervention (PSI) using a narrative platform: An experimental study of first graders at risk for a reading disability. Communication Disorders Quarterly, 33, 3–12. doi:10.1177/1525740109356800

72.

Romano

(2006, March 24). College Board acknowledges more SAT scoring errors. The Washington Post. Retrieved from http://www.washingtonpost.com/wp-dyn/content/article/2006/03/23/AR2006032301655.html

73.

Sackes

(2000). Standardized minds: The high price of America’s testing culture and what we can do to change it. Cambridge, MA: Perseus.

74.

*Santoro

L. E.

Coyne

M. D.

Simmons

D. C.

(2006). The reading-spelling connection: Developing and evaluating a beginning spelling intervention for children at risk of reading disability. Learning Disabilities Research & Practice, 21, 122–133.

75.

*Simmons

D. C.

Kame’enui

E. J.

Harn

Coyne

M. D.

Stoolmiller

Santoro

L. E.

. . . Kaufman

N. K.

(2007). Attributes of effective and efficient kindergarten reading intervention: An examination of instructional time and design specificity. Journal of Learning Disabilities, 40, 331–347.

76.

*Spencer

S. A.

Manis

F. R.

(2010). The effects of a fluency intervention program on the fluency and comprehension outcomes of middle school students with severe reading deficits. Learning Disability Research and Practice, 25, 76–86.

77.

Stein

Berenda

Fuchs

McMaster

Saenz

Yen

Compton

(2008). Scaling up an early reading program: Relationships among teacher support, fidelity of implementation, and student performance across different sites and years. Educational Evaluation and Policy Analysis, 30, 368–388.

78.

Stockard

(2010). An analysis of the fidelity implementation policies of the What Works Clearinghouse. Current Issues in Education, 13(4), 1–24. Retrieved from http://cie.asu.edu/

79.

Swanson

Wanzek

Haring

Ciullo

McCulley

(2011). Intervention fidelity in special and general education research journals. Journal of Special Education, 47, 3–13. doi:10.1177/0022466911419516

80.

Tanner

(2013). Race to the top and leave the children behind. Journal of Curriculum Studies, 45, 4–15. doi:10.1080/00220272.2012.754946

81.

*Therrien

W. J.

Wickstrom

Jones

(2006). Effect of a combined repeated reading and question generation intervention on reading achievement. Learning Disabilities Research & Practice, 21, 89–97.

82.

*Thompson

S. L.

Davis

P. H.

(2002). Supplemental reading instruction for students at risk for reading disabilities: Improve reading 30 minutes at a time. Learning Disabilities Practice, 17, 242–251.

83.

*Torgesen

J. K.

Alexander

A. W.

Wagner

R. K.

Rashotte

C. A.

Voeller

K. S.

Conway

(2001). Intensive remedial instruction for children with severe reading disabilities. Journal of Learning Disabilities, 34, 33–58.

84.

Torgesen

J. K.

Wagner

R. K.

Rashotte

C. A.

(1999). Test of word reading efficiency. San Antonio, TX: PRO-ED.

85.

*Torgesen

J. K.

Wagner

R. K.

Raschotte

C. A.

Herron

Lindamood

(2010). Computer-assisted instructions to prevent early reading difficulties in students at risk for dyslexia: Outcomes from two instructional approaches. Annals of Dyslexia, 60, 40–56. doi:10.1007/s11881-009-0032-y

86.

Towne

Wise

L. L.

Winters

T. M.

(Eds.). (2005). Advancing scientific research in education. Washington, DC: National Academies Press.

87.

Tran

Sanchez

Arellano

Swanson

H. L.

(2011). A meta-analysis of the RTI literature for children at risk for reading disabilities. Journal of Learning Disabilities, 44, 283–295. doi:10.1177/0022219410378447

88.

U.S. Department of Education. (2013). Testing integrity symposium: Issues and recommendations for best practice. Washington, DC: Institute of Education Sciences and National Center for Education Statistics. Retrieved from http://nces.ed.gov/pubs2013/2013454.pdf

89.

*Ukrainetz

T. A.

Ross

C. L.

Harm

H. M.

(2009). An investigation of treatment scheduling for phonemic awareness with kindergartners who are at risk for reading difficulties. Language, Speech, and Hearing Services in Schools, 40, 86–100. doi:10.1044/0161-1461(2008/07-0077)

90.

*Vadasy

P. F.

Sanders

E. A.

Peyton

J. A.

(2005). Relative effectiveness of reading practice or word-level instruction in supplemental tutoring: How text matters. Journal of Learning Disabilities, 38, 364–380.

91.

*Vadasy

P. F.

Sanders

E. A.

Peyton

J. A.

Jenkins

J. R.

(2002). Timing and intensity of tutoring: A closer look at the conditions for effective early literacy tutoring. Learning Disabilities Research & Practice, 17, 227–241.

92.

*Vaughn

Cirino

Wanzek

Wexler

Fletcher

J. M.

Denton

C. A.

. . . Francis

(2010). Response to intervention for middle school students with reading difficulties: Effects of a primary and secondary intervention. School Psychology Review, 39, 3–21.

93.

Vaughn

Roberts

Klingner

J. K.

Swanson

E. A.

Boardman

Stillman-Spisak

S. J.

. . . Leroux

A. J.

(2013). Collaborative strategic reading: Findings from experienced implementers. Journal of Research on Educational Effectiveness, 6, 137–163. doi:10.1080/19345747.2012.2.741661

94.

*Vaughn

Thompson

S. L.

Mathes

P. G.

Cirino

Carlson

C. D.

Pollard-Durodola

S. D.

. . . Francis

D. J.

(2006). Effectiveness of Spanish intervention for first-grade English language learners at risk for reading difficulties. Journal of Learning Disabilities, 39, 56–73.

95.

*Vaughn

Wexler

Roberts

Barth

A. A.

Cirino

P. T.

Romain

M. A.

. . . Denton

C. A.

(2011). Effects of individualized and standardized interventions on middle school students with reading disabilities. Exceptional Children, 77, 391–407.

96.

*Vernon-Feagans

Gallargher

Ginsberg

M. C.

Amendum

Kainz

Rose

Burchinal

(2010). A diagnostic teaching intervention for classroom teachers: Helping struggling readers in early elementary school. Learning Disabilities Research & Practice, 25, 183–193.

97.

*Wanzek

Roberts

(2012). Reading interventions with varying instructional practices emphases for fourth graders with reading difficulties. Learning Disability Quarterly, 35, 90–101. doi:10.1177/0731948711434047

98.

*Wanzek

Vaughn

Roberts

Fletcher

J. M.

(2011). Efficacy of a reading intervention for middle school students with learning disabilities. Exceptional Children, 78, 73–87.

99.

Waterman

McDermott

P. A.

Fantuzzo

J. W.

Gadsden

V. L.

(2011). The matter of assessor variance in early childhood education—Or whose score is it anyway? Early Childhood Research Quarterly, 27, 46–54. doi:10.1016/j.ecresq.2011.06.003

100.

Wilder

A. A.

Williams

J. P.

(2001). Students with severe learning disabilities can learn higher order comprehension skills. Journal of Educational Psychology, 93, 268–278.

101.

Wollack

J. A.

Fremer

J. J.

(2013). Handbook of test security. New York, NY: Routledge.

102.

Woodcock

(1998). Woodcock Reading Mastery Test–Revised/Normative Update. Circle Pines, MN: American Guidance Service.