Abstract
Researchers have expressed concern about implementation fidelity in intervention research but have not extended that concern to assessment fidelity, or the extent to which pre-/posttests are administered and interpreted as intended. When studying reading interventions, data gathering heavily influences the identification of students, the curricular components delivered, and the interpretation of outcomes. However, information on assessment fidelity is rarely reported. This study examined the fidelity with which individuals paid to be testers for research purposes were directly observed administering and interpreting reading assessments for middle school students. Of 589 testing packets, 45 (8% of the total) had to be removed from the data set for significant abnormalities and another 484 (91% of the remaining packets) had correctable errors only found in double scoring. Results indicate reading assessments require extensive training, highly structured protocols, and ongoing calibration to produce reliable and valid results useful in applied research.
In the past decade, greater emphasis has been placed on the rigor with which applied research is conducted in educational settings (National Research Council, 2002). Although not unique to a particular research methodology, the concern with quality indicators has been more evident for intervention research conducted through experimental or quasi-experimental studies (Towne, Wise, & Winters, 2005). This is likely due to the prominent role questions of “what works” play in educational reform efforts and initiatives aimed at improving outcomes for students with learning difficulties (Institute of Education Sciences, 2008; U.S. Department of Education, 2003). Among the study characteristics noted for improving the clarity of the causal inferences that can be made is the degree of implementation fidelity (Gersten et al., 2005).
Century, Freeman, and Rudnick (2008) define fidelity as “The extent to which an enacted program is consistent with the intended program model” (p. 2). In other words, fidelity concerns whether those responsible for implementing the intervention components did what the researchers and other program developers planned and, subsequently, reported. The broad concept is often subdivided into various components associated with the intervention’s design, training, delivery, and receipt (Gearing et al., 2011). From a development standpoint, implementation fidelity helps researchers identify which specific elements of a program need to be refined and, at the same time, helps them reduce the “noise” so that they can more accurately determine what works, to what extent, and under what conditions.
Because intervention studies in school settings often rely on teachers to deliver the treatments, measuring fidelity can be important to understanding any variability in outcomes. For example, findings have revealed students whose teachers had higher fidelity to literacy interventions experienced more growth in reading than those whose teachers have lower levels of implementation (Benner, Nelson, Stage, & Ralston, 2011; Foorman & Moats, 2004; Levin, Catlin, & Elson, 2010; Stein et al., 2008). Similar conclusions were drawn in studies examining science achievement (Echevarria, Richards-Tutor, Chinn, & Ratleff, 2011; Furtak et al., 2008).
Although social scientists agree that accounting for fidelity is important to the internal validity of outcome studies, it is insufficiently reported (Gearing et al., 2011; Gresham, MacMillan, Beebe-Frankenberger, & Bocian, 2000), has no standard metric of measurement (O’Donnell, 2008), and is not explicitly featured in determinations of treatment effectiveness (Stockard, 2010; Swanson, Wanzek, Haring, Ciullo, & McCulley, 2011). Despite the concerns, reports (such as those by Century et al., 2008; Gearing et al., 2011; and O’Donnell, 2008) that have attempted to better delineate the components of implementation fidelity and how it should be measured have largely ignored one particular facet: the administration and interpretation of pre- and posttests. Grisham-Brown, Hallam, and Pretti-Frontczak (2008) referred to this as assessment fidelity, or the extent to which assessments used in research studies were performed and scored as intended.
Background on Assessment Fidelity
Two characteristics are apparent in the extant literature that might be considered a part of assessment fidelity. One is having strong interrater reliability or agreement among independent raters on how a performance-based assessment would be scored, and the other is utilizing data collectors who are blind to study conditions so as not to bias their scoring in favor of a given treatment (Gearing et al., 2011; Gersten et al., 2005). However, the larger consternation over implementation fidelity would suggest it is unwise to assume pre- and posttests consistently are being administered, scored, and interpreted by qualified examiners who are closely following directions provided in the technical manuals. As a practice guide for an educational service delivery model admonishes, “Fidelity must also address the integrity with which [assessment] procedures are completed and an explicit decision-making model is followed” (Johnson, Mellard, Fuchs, & McKnight, 2006, p. 42).
This concern appears to be well founded in educational settings. A survey of 851 special education teachers, school psychologists, and speech/language teachers revealed that practitioners tended to overrate the technical adequacy of the tests they used and lacked knowledge of the proper procedures surrounding use of the instruments (Davis & Shepard, 1983). In addition, one third to one half of those surveyed could not correctly interpret ability-achievement score discrepancies, which at the time of the study were the primary means by which students were identified with a learning disability (LD). Moreover, teachers’ opinions of which students should be placed in special education are often not in accordance with direct measures of the students’ academic performance (Gerber & Semmel, 1984; Harry, Sturges, & Klingner, 2005; MacMillan, 1998; VanDerHeyden, Witt, & Gilbertson, 2007).
Specifically with respect to students’ reading ability, the skill most commonly associated with identification of LD (Lyon et al., 2001), teacher judgments have been found less accurate and reliable at determining who needs reading intervention than curriculum-based measures of students’ oral reading fluency (ORF; Madelaine & Wheldall, 2005; VanDerHeyden et al., 2007). Yet, teachers have repeatedly been found to believe their own judgments of student performance are more helpful and informative than objective test data (Marsh, Pane, & Hamilton, 2006; Pugach, 1985; Wayman, Cho, & Johnston, 2007). Manipulating the use of student assessments to confirm a hypothesis formed about a student’s abilities has been referred to as the professional judgment model of assessment (Davis & Shepard, 1983). This threat to fidelity is based, in part, on practitioners’ perceptions of the credibility of the assessment/intervention (Reschly & Gresham, 2006). If teachers do not believe a measure is capable of providing accurate information about students’ abilities, it is reasonable to assume teachers will not feel compelled to administer/implement it well. To some extent, teacher beliefs might be based on their depth of knowledge about the instruments being used.
In two studies directly examining assessment fidelity, improvements were realized when extensive training was provided (Grisham-Brown et al., 2008; Stitt, Simonds, & Hunt, 2003). The one study involving in-service teachers (n = 18) included initial training on the performance-based measures, weekly technical assistance visits, and follow-up reliability training focused on scoring accuracy (Grisham-Brown et al., 2008). Nevertheless, there was still a great deal of variability in assessment fidelity across observations, activities, and participants. The lowest average ratings were usually given for adherence to the test procedures. This is consistent with the complexity threat to fidelity identified by social scientists in that the number of components associated with interventions/assessments and the intricacy of each component can make it more difficult to achieve consistency in their implementation/administration (Gearing et al., 2011; Gresham et al., 2000).
Even if initial proficiency is achieved, such as meeting a standard for interrater reliability during the training, there is the potential for deviations or drift from the protocols to occur over time (O’Donnell, 2008; Perepletchikova & Kazdin, 2005). In implementation fidelity, social scientists acknowledge contextual factors may warrant some variation from the theoretical ideal state (Dusenbury, Brannigan, Falco, & Hansen, 2003; Moncher & Prinz, 1991). However, in assessment fidelity, the reliability and validity of the measure are dependent on the consistency with which it can be administered, scored, and interpreted. As Cronbach (1971) advised, validity is not inherent within a test but is evaluated for each testing application.
Purpose
A lack of fidelity in the administration and interpretation of reading assessments at pretest has the potential to impact the proper placement of students in reading intervention (Madelaine & Wheldall, 2005; VanDerHeyden et al., 2007) and the curricular components of those treatments (Catts, Hogan, & Fey, 2003; Hock et al., 2009). In addition, assessment fidelity can affect the evaluation of the intervention’s effectiveness at posttest (Johnson et al., 2006). Given so little is reported about fidelity in general (Gearing et al., 2011; Gresham et al., 2000), this research was conducted to better understand the extent to which it might be a concern when interpreting intervention study results by directly monitoring paid testers on the administration and scoring of common reading assessments. The research question addressed was as follows:
Research Question 1: To what extent do testers hired for a research project administer and score reading assessments in a manner consistent with the technical manuals on which they were trained?
Previous comprehensive reviews examining issues of fidelity in intervention research have found insufficient reporting of that information, so the current study did not take a broad perspective on examining assessment fidelity. Rather, it narrowed the focus to reading interventions for middle school students to gather a representative sample of research in the field. The focus on middle school reading interventions was chosen for two reasons. First, reading difficulties are the primary reason students are identified with LD (Lyon et al., 2001), and reading interventions have been the focus of some of the U.S. Department of Education’s most expensive and ambitious initiatives (McKenna & Walpole, 2010). Second, the numbers of students identified with LD nearly doubles in early adolescence (Leach, Scarborough, & Rescorla, 2003; Lipka, Lesaux, & Siegel, 2006), making the middle school years a critical period for intervening and preventing more pervasive academic failure (ACT, 2008).
The purpose of this exploratory study was to examine assessment fidelity as it might impact research using large group designs, such as experimental and quasi-experimental studies. Therefore, the research focused on a testing session conducted as it might occur if assessing student abilities pre- and post-intervention and tester behaviors were directly observed.
Method
Participants
The participants of interest for this study were 29 adults hired to administer an ORF assessment to 589 middle school students in two different cities in the Southwest. As can be seen from the information in Table 1, the testers represented the range of backgrounds potentially employed as assessment administrators in research studies. The majority of the testers were female (86%) and White (55%) as is common in education (Horace Mann Companies, 2005). As a reflection of the educators in the locations where the study took place, a large number of testers were White (55%) or Hispanic (38%; all Spanish–English bilingual), and fewer were African American (7%). By experience and role, testers were paraprofessionals in the participating school districts (14%), graduate or undergraduate students in the local colleges of education (38%), currently employed in education-related fields (17%), or retired teachers (31%).
Tester Characteristics.
To impress upon the testers the importance of their work, all were paid US$120 per day and informed they were participating in research on the use of the assessments they were administering. Large intervention projects using a battery of measures have reported hiring testers rather than relying on teachers at the school sites to administer and score the assessments (e.g., Vaughn et al., 2010). This is consistent with recommendations testers be blind to study conditions or not have a vested interest in the outcomes of individual students or groups of students (Gearing et al., 2011; Gersten et al., 2005).
Measure
The test administered to the sixth- through eighth-grade students was the passage reading fluency subtest of the Texas Middle School Fluency Assessment (TMSFA; Francis, Barth, Cirino, Reed, & Fletcher, 2010). At the time of the study, it was required under state legislation to be administered to middle school students who failed the annual accountability assessment of reading. The passage reading fluency subtest of the TMSFA is an individually administered ORF instrument consisting of three passages, narrative and informational, specifically assigned by grade level. All passages were equated with the mean intercorrelation ranging from .86 to .98. The criterion validity of the assessment (r = .50) was established with the state test of reading comprehension used at the time of the instrument’s development (Francis et al., 2010).
Using scripted instructions on the cover page of the testing packet, the technical manual specifies that testers are to instruct students to read each passage out loud for 1 min. As students read, the testers are to mark substitutions, mispronunciations, alterations, reversals, hesitations lasting 3 s, and skips as errors on their copy of the passage. Insertions (adding an extra word that was not a part of the sentence) are not to be counted as errors because students already suffer a loss of reading time when saying an extra word. When the 1-min timer sounds, the tester is to circle the last word the student read, cover up the passage, and then deliver a scripted prompt to retell the passage.
To add to the complexity of the testing, an identified threat to fidelity (Gearing et al., 2011; Gresham et al., 2000), different variations of the retell prompt were used. Some examiners were trained to deliver the prompt: “Tell me in your own words what this passage was mostly about.” Others were trained to deliver the prompt: “Tell me everything you remember reading in the passage.” In addition, some testers were trained to deliver the follow-up prompt, “Do you remember anything else?” whenever the student paused in retelling the passage until the student indicated he or she had nothing else to say. Other testers were trained not to deliver any follow-up prompting or encouragement to continue retelling but simply to move on to the next passage whenever the student stopped talking. These scripted instructions were provided to each tester with the testing packets. Although not a focus here, information on how alterations to the prompt affect student performance is reported by Reed and Petscher (2012).
Testers were to transcribe the student’s retell as accurately as possible for later scoring by two specially trained members of the research team. In research studies, designated scorers are commonly used to maintain acceptable interrater reliability (Puranik, Lombardino, & Altmann, 2008). However, the testers themselves were taught to calculate the number of words the student read correctly per minute (WCPM) by tallying the errors they marked and subtracting that sum from the total number of words the student read within the time limit. The transcribed retells, total number of words read, number of errors tallied, and calculated WCPM were recorded for each of the three passages on a student record sheet provided with each testing packet.
Training of Testers
The first author (hereafter referred to as the principal investigator [PI]) conducted a half-day training for the testers at each of the two school sites involved in the study. Nine individuals were trained at Site 1 and the remaining 20 were trained at Site 2. The training included information on the theoretical basis of the assessment, the development of the components, and the intended uses of the measure in school settings. In addition, participants were taken through a gradual release model of instruction in how to administer the assessment. In the first phase of the instruction, the PI modeled how to administer the assessment and showed a video of another teacher administering the measure to a student. In the second phase of the instruction, the PI guided the participants in practicing the test administration procedures with the use of digital recordings of a student reading the passages. In the final phase of the instruction, participants tried administering the measure independently using a digital recording in place of a live student. The entire training had previously undergone a vetting process that included expert review and field testing with groups of teachers and professional trainers (ICF International, 2009).
During the half-day training, each tester had to demonstrate WCPM scoring accuracy in the practice administrations. Those whose marking of errors or calculation of the WCPM was off by more than ±1 on a single passage continued their independent practice using a fellow tester as the student reader until achieving the ±1 error standard in marking errors and calculating the WCPM. In addition, the testers practiced the delivery of their assigned retell prompts until demonstrating 100% accuracy.
The two research assistants (RAs) were specially trained to score the retells according to procedures developed by Reed and Petscher (2012). One RA, a bilingual Hispanic female, was an experienced middle school reading interventionist and a certified reading teacher. The other, a White male, was a certified science teacher who had been working on a statewide content area literacy initiative for the past 2 years. Both had master’s degrees in an education-related field. The RAs worked with the PI to identify, discuss, and agree upon the most important ideas in the passages. Next, the idea units were ordered to align with the word counts so that a scorer could determine how many idea units a student who read to a given point in the passage could possibly retell. The number of idea units actually retold by the student could then be divided by the number possible for him or her to achieve a percentage of important ideas retold.
The RAs independently scored sample responses to the passages and then calculated their interrater agreement. This was done by dividing the number of agreements on the proportion of idea units included in a student’s retell by the total number of retells scored. Initially, the observed agreement was only 66%, so the RAs discussed the reasons for their scores to improve their calibration. They continued practicing their scoring of sample responses until their interrater agreement on a set of 25 retells was consistently above the desired 85% threshold (Bracey, 2000). With ongoing calibration during the scoring, the RAs were able to maintain 90% or better agreement.
Testing Procedures
When administering the ORF assessment, the testers sat across from each student assigned to them and were instructed to follow the scripted procedures preprinted on the cover sheet of the examiner’s test packet. Approximately 4% of the students were absent on the day of testing and were administered the test on the subsequent day. On average, a tester assessed 20 students over a 2.5-day period that immediately followed the half-day of training.
The PI and two RAs remained in the testing rooms the entire time the measure was being administered to monitor the testers’ fidelity to the delivery of directions, starting and stopping the timer, marking errors as students read aloud, delivering the retell prompts, recording students’ retell responses, and transitioning to the second and third passages. There were no more than 10 testers in a room, and their times with individual students began at staggered intervals to facilitate careful observation and avoid students overhearing each other’s responses. Tester–student pairs were distributed around the room with students facing the walls to minimize distractions. Whenever an error in administration procedures was observed, the PI or RAs flagged the testing packet and noted the particular problem on the student roster.
Analysis
With the exception of the retell responses, the ORF assessment was initially scored by the individual who administered the test and was scored a second time by the RAs. During the double scoring, the RAs identified the mistakes testers made in the packets. The RAs also scored the retells when checking each packet and then exchanged all packets so that the retells could be scored a second time. As they found disagreements in their retell scoring, the RAs discussed their reasoning to improve their calibration.
Results
During the administration of the ORF assessment, 45 test packets (8% of the total sample) were flagged for significant abnormalities that could not be corrected. These losses of fidelity included forgetting to administer a passage, forgetting to start the 1-min timer, not stopping the student or circling the last word when the timer sounded, not delivering a retell prompt at all, not adhering to the scripted retell prompt, and forgetting to cover up the passage when the student was retelling it. In each case, the tester was unaware he or she had deviated from the protocol until the packet was flagged by the PI or an RA and the reason explained. However, there were two abnormalities that seemed to be more than simple forgetfulness. Both involved testers who were retired teachers.
One tester was observed asking a student to repeat what she was saying, which caused a loss of time during the 1-min reading and may have discouraged the student from retelling information afterward. For example, after repeatedly being asked to restate part of her response, the student stopped attempting to restate it and instead said, “Nothing.” Hence, the retell that was recorded for scoring purposes was a truncated version of the response she attempted to provide orally. When asked whether there was a problem, the tester revealed to the researcher he had a slight hearing loss that he did not disclose when hired and that did not surface as an issue during the training where he met the initial reliability requirement.
The PI flagged his test packets, reminded the tester of the protocol, and explained to him the reasons why asking a student to repeat information affected the results. The RAs subsequently observed the tester not marking errors as a student was reading orally, presumably because he could not hear the student well enough to detect when an error was made. However, the tester was not forthcoming with this information and tried to insist that he was able to continue working on the project despite his difficulty hearing. After moving him to a location in the testing room that was set slightly further apart from the others, he was again observed asking a student to repeat information. All of his testing packets were flagged for separate analysis.
Another tester had been assigned to deliver a retell prompt that did not include follow-up prompting. However, the PI observed her using hand motions to encourage a student to provide more information in his retell until he stated a particular detail. At that point, the tester pumped her fist and said, “Yes!” When the PI flagged the test packet, the tester insisted she had not provided follow-up prompting because she had not said the phrase, “Do you remember anything else?” that other testers were trained to deliver. Although the PI explained to her that the hand motioning and other encouragement was tantamount to follow-up prompting, the tester continued to deny she had deviated from her protocol and asked an RA about it during a break. Any of the tester’s packets that could not be confirmed by the PI or RAs as having been administered with fidelity were separated from the larger data set.
After removing the 45 packets that had been flagged for significant abnormalities, the RAs double scored the remaining 532 testing documents and noted when mistakes were found. Approximately 484 of the test packets (91%) had correctable mistakes that included counting insertions as errors, miscounting the number of errors, and miscalculating the WCPM. A higher rate of error was found among the tests scored by retired teachers (94%), but this was highly variable by individual (range of 87%–100%) and influenced by the one tester with a hearing difficulty.
While scoring the students’ retells, the RAs kept track of the number of responses for which they had discrepant scores. Discrepancies varied from a low of 5% to a high of 20% of the responses, depending on the passage. For some passages, student responses were not as closely matched to the statements on the scoring guide. The range in discrepancies was often due to raters disagreeing about whether a loose paraphrase of content warranted credit. As these issues were identified in the double scoring, the RAs resolved them in discussion with the PI. The overall average interrater agreement remained 90% or better (with an intraclass correlation of .98) through the RAs’ ongoing calibration.
Discussion
This study examined the assessment fidelity among a group of 29 testers with a range of backgrounds who were hired to administer an ORF assessment to students in Grades 6 to 8. Results suggest testers potentially introduce significant threats to the reliability of assessments and the internal validity of the studies in which they are used. Fully 8% of the test packets had abnormalities that resulted in the data either being missing or uncorrectable—thus precluding their use in data analyses or instructional decision making. However, had the PI and RAs not been monitoring the administration of the test, the uncorrectable errors would have made it into the final data set without the research team being aware some of the information was faulty.
Most of the more egregious problems were procedural, as was found by Grisham-Brown et al. (2008), and were committed unwittingly. The flagging of packets served as a sort of calibration process because the testers were made aware of the error that had been observed. Without the monitoring and feedback, it is likely there would have been a higher prevalence of problems—problems, again, that were uncorrectable and potentially unknown to the research team who would be interpreting students’ results. Of note among the egregious abnormalities were the two testers (7% of the tester group) who were not convinced their alterations of the protocols represented drift that was introducing a problem with the data.
Based on previous work suggesting that more thoroughly trained testers have higher fidelity (Grisham-Brown et al., 2008; Stitt et al., 2003), the training provided here included information on the theoretical basis of the assessment, the development of the components, and the intended uses of the measure in school settings. In addition, the gradual release model of instructing testers on the administration helped them meet the initial reliability standard of making no more than ±1 error in marking errors during the oral reading and making no errors on the delivery of the retell prompt. The content and delivery of the training had been vetted and field-tested with numerous groups prior to its use in this study (ICF International, 2009). Nevertheless, the combined efforts to increase the quality and rigor of the training did not inoculate the testers against an increase in errors across their approximately 20 administrations of the measure over 2.5 days.
The stunning number of correctable errors in 91% of the test packets remaining in the data set is assumed to be far higher than what would occur with group-administered tests that are machine-scored. Despite their limitations, performance-based tests of ORF are considered one of the better predictors of students’ reading achievement (Reschly, Busch, Betts, Deno, & Long, 2009; Yeo, 2010), and retell measures are one of the most common classroom-based comprehension assessments (Cohen, Krustedt, & May, 2009). Without double scoring, the results of this study suggest research findings and classroom teachers’ judgments might be highly inaccurate. It should also be noted the errors observed here were all made in administering a single assessment. It is not uncommon in reading intervention research to administer multiple tests at a time, so the rate of problems could be compounded by having so many different procedures to remember and monitor.
Finally, the use of the retell scoring guide required a more robust calibration process than the objective WCPM calculation. Without ongoing discussion of discrepancies, interrater reliability of retell scoring might have plunged to 80%, which is below the desired threshold (Bracey, 2000). By continuously calibrating their application of the instrument that they helped to devise, the highly trained and knowledgeable RAs were able to maintain an average 90% or better agreement. Because raw agreement is a more liberal index than that adjusted for chance, a higher percent is desirable (Tinsley & Weiss, 2000). However, the requisite process involved raises questions about a more distributed and less regulated use of retell scoring mechanisms in outcome studies and when increasing the number of raters involved in the process.
Summary and Implications
This study sought to contribute knowledge about the extent to which assessment fidelity might be a concern in intervention research. The inattention to assessment fidelity is not a trivial matter. If undetected and/or uncorrected, the high number of threats introduced by the relative level of tester expertise, test complexity, and drift had the potential to significantly alter results at the individual and aggregate levels.
Accepted standards for interrater reliability imply up to 15% nonagreement among testers might be inevitable (Bracey, 2000; Tinsley & Weiss, 2000). Interrater agreement does not seem a reasonable proxy for assessment fidelity in the larger sense, but it is one of only two testing-related fidelity elements currently mentioned in documents outlining quality indicators for intervention research (Gearing et al., 2011; Gersten et al., 2005).
The question then becomes: What are the implications for students when testers improvise with protocols that are designed to be used with fidelity? Because reading assessments are used to move students into and out of instructional interventions (Madelaine & Wheldall, 2005; VanDerHeyden et al., 2007) as well as design the particular components of those treatments (Catts et al., 2003; Hock et al., 2009), inaccurate test data have real consequences. Not only would the true effects of a studied intervention be obscured (Stockard, 2010), but also, the individual students who had the misfortune of experiencing low assessment fidelity could be denied the most advantageous outcomes possible (Benner et al., 2011; Levin et al., 2010).
This study also highlights a set of perennial challenges in obtaining optimal assessment fidelity. Hiring highly skilled testers and maintaining their expertise through ongoing training with corrective feedback and calibration efforts may mitigate problems with accuracy, albeit at a cost (Waltz, Addis, Koerner, & Jacobson, 1993), but teacher beliefs about the efficacy of instruments and their own ability to diagnose reading needs may play a bigger role in this context than might be expected. Other research has found teachers are less likely to believe in the accuracy and usefulness of externally imposed assessment tools than their own judgments (Marsh et al., 2006; Wayman et al., 2007). Thus, a follow-up question becomes, “Is it possible to offer a commensurable literacy assessment system that makes space for teacher expertise and helps generate teacher belief in and proper implementation of externally developed literacy assessments?”
Limitations and Directions for Future Research
Although findings presented here were consistent with those of other research (Grisham-Brown et al., 2008; Stitt et al., 2003) that point to the need for extensive tester training, it is still unclear how much training is enough and with what proportion of background or theoretical information on the measures, demonstration of administration and scoring, practice opportunities, and follow-up training. In addition, more information is needed on the kinds and number of practice opportunities or follow-up activities that optimize tester performance.
As a more exploratory look at assessment fidelity issues, this study did not attempt to determine each tester’s level of knowledge about the constructs being tested or the measure itself. This information might be useful in qualifying testers if future research establishes a correlation between such indicators of expertise and an individual’s degree of fidelity to the protocols. The study also did not attempt to probe tester beliefs or sensemaking to determine what thinking underlies improvisation or increases fidelity, and whether it is possible to do both simultaneously in an era in which accountability and efficacy are touted with similar levels of volume. Teacher beliefs and sensemaking in the context of literacy assessment is a greatly underexplored area. Borrowing sociological case study approaches to understanding assessment from teachers’ perspectives, such as the one used by Washington and Humphries (2011), may help resolve the assessment fidelity conundrum.
It should be noted this study was designed to parallel assessment as occurs in studies using group designs and large numbers of participants. Therefore, paid testers were administering the measures rather than the students’ own classroom teachers. No conclusions can be drawn about teacher fidelity to assessment procedures. It is also too preliminary to generalize findings from this particular pool of testers. As an exploratory study with no extant research to indicate the number or type of possible problems to expect, only three observers were used to monitor up to 10 testers at a time. Care was taken to facilitate observation of the testers, but the problems encountered were far greater than anticipated. The PI and RAs were aware that as an issue arose with one tester, they became sensitized to that individual and that type of error—potentially to the detriment of catching other issues with other testers. Hence, the true percentage of accuracy for each tester cannot be determined with confidence. This raises two important points. First, best practice in determining a student’s reading achievement would involve the use of multiple measures and multiple testers to prevent negative consequences from imperfect assessment tools and administration. Second, future research on assessment fidelity should monitor testers more systematically so one tester is observed at a time.
Finally, it is not possible to draw conclusions about the environmental factors that might affect assessment fidelity such as tester-to-student ratios, the number of testers per room, or the number of students being tested at one time. When conducting intervention research in school settings, the researchers are often at the mercy of what space and time the administrators and teachers can reasonably afford them. Although it was not preferred to have 10 testers per room, teachers had to move their classes to provide any rooms at all in which to conduct the testing. Distractions were minimized to the extent possible by dispersing tester–student pairs throughout the room, facing students toward walls, and starting at staggered intervals. However, the increased ambient noise in the testing versus the training rooms obviously impacted the ability of one tester to hear his examinee well enough to administer the measure with accuracy. It is possible other testers and students in this study could have been affected by the testing environment as well. These factors may not be within the purview of the researchers to control, so future research on assessment fidelity might investigate whether and what kind of environmental factors significantly affect results. In addition, research might determine whether assistive devices, such as whisper phones for students reading aloud, and tester training conditions that match the school setting have the potential to reduce environmental threats to administering tests according to protocol.
Current guidance on ensuring the quality of intervention research emphasizes what should be done during the design, intervention implementation, analysis, and reporting of the study (Gersten et al., 2005; Towne et al., 2005). However, little is offered to guide the assessment phases. Researchers are admonished to select measures of high quality with established reliability and validity, but not necessarily to monitor the quality with which those measures are used. Findings reported here might caution reading intervention researchers from taking for granted that testers are uniformly qualified and consistently administering assessments as intended—even when accounting for the initial training testers receive. This study demonstrates the need for focusing more attention on assessment fidelity in implementing studies and interpreting their results.
The results have introduced a new set of issues into a seemingly straightforward realm of implementation. From a programmatic perspective, there may be an issue of compliance to protocols and the related issue of adequate training on the measures. The authors believe, however, other areas of inquiry must be explored to understand literacy assessment quality better. For instance, what additional factors need to be addressed and defined before research can accurately define causality? Closely related to this, what are testers’ beliefs about literacy assessment and how do those beliefs affect testers’ willingness and ability to assess with fidelity to established protocols? Is it possible to work within the framework of those beliefs to enhance assessment fidelity? Beyond the research needs identified in this article, the study has identified a rich opportunity for protocol and training development that mediates teacher and researcher discourse and conceptual frames.
Footnotes
Authors’ Note
The content is solely the responsibility of the authors and does not necessarily represent the official views of the Texas Education Agency.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by contract No. 2356-27235 from the Texas Education Agency.
