Abstract
Running Records are thought to be an excellent formative assessment tool because they generate results that educators can use to make their teaching more responsive. Despite the technical nature of scoring Running Records and the kinds of important decisions that are attached to their analysis, few studies have investigated assessor accuracy. We measured precision across 114 teachers who were given a pre-coded Running Record to analyze by comparing their quantification and interpretation of the record against a scoring key. We also used Rasch measurement to examine which items teachers found easy and hard to score accurately. Analyses revealed wide variation in teachers’ accuracy, particularly in interpretation, that, according to item difficulty analysis, were caused by a few specific mistakes in scoring. Implications to improve training so that individuals administer Running Records more reliably are shared.
Scoring running records: Complexities and affordances
Running Records, an assessment tool that relies on systematic observation to produce a coded written record of an individual student’s oral reading, is widely used to monitor beginning readers’ word identification strategies and reading levels (Stahl et al., 2020; Temple et al., 2018). It is said to be excellent tool for formative assessment in that it provides a precise account of everything a student did while reading independently; information that a teacher can use to inform instruction (Harmey and Kabuto, 2018; Johnston and Afflerbach, 2015).
Reading specialists, teachers, and researchers use the results to identify a text level for instruction, group students for teaching, and measure and compare growth (Bean et al., 2002; Kragler et al., 2015). The results are used across a wide variety of instructional settings including one-to-one and small groups (Barone et al., 2019), and with a variety of students including multilingual students (Briceño and Klein, 2019; Peregoy and Boyle, 2017), and those having difficulty learning to read (McCormick and Zutell, 2015).
Yet, despite the technical nature of scoring a record, few studies have investigated assessor accuracy. This lack of reliability evidence is surprising, given the assessment tool’s wide use, and the multistep nature of administering and scoring; there are multiple sources of potential error, that if not addressed, would reduce score reliability. Indeed, Running Records may be an excellent formative assessment tool but only if reliable results are generated; otherwise instruction can be misinformed, and important decisions made with unintended consequences for early literacy development.
Thus, the purpose of this study is to report teacher precision in quantifying and interpreting the same, already-coded Running Record. We report findings from our study in which over 100 teachers who were recently trained in Running Record administration and scoring were given the same precoded Running Record to analyze. We also report which items teachers found easier and harder to complete accurately. Knowing more about Running Record assessor accuracy and identifying likely sources of difficulty with accurate scoring may contribute to the reliability and validity of results and thus improve the quality of information generated to plan beginning instruction.
Running Records
The Running Records task is one of six assessments contained in An Observation Survey of Early Literacy Achievement (Clay, 2013). The assessment, developed by Clay (1966), contains standard directions for administering and scoring (Clay, 2013, 2019; McCormick and Zutell, 2015) in order to produce reliable and valid results (Clay, 2013: 12), as might be expected of any useful measurement tool.
Along with informal reading inventories (IRIs) and miscue analysis (Goodman, 2015), Running Records can be classified as a type of oral reading inventory, which generally is a group of assessments that rely on observing and using standard codes to record oral reading behaviours. Yet, even though Running Records are similar to miscue analysis and IRIs, they are different in many important ways.
Unlike IRIs, which use preprinted grade-level passages and are designed to be administered two or three times a year to determine instructional reading levels, Running Records are taken of texts selected by the teacher and are designed to be given frequently in order to monitor progress (Clay, 2013: 53; Stahl et al., 2020) and inform instruction (Clay, 2013: 52; Stahl et al., 2020). In that sense, because teachers can choose the texts to use, Running Records are sometimes called “teacher-made” assessments (Temple et al., 2018: 328). IRI’s passages are ordered in grade level bands; whereas Running Records can be taken on any text read by the student, making it possible for a teacher to measure progress across finer gradients of time than a grade level, a desirable quality for an early reading assessment.
Running Records are like miscue analysis in that they are both formative assessments used to directly inform instruction, and they both use a shorthand code to produce a written record of oral reading (Wang et al., 2020). They differ, however, in several key ways, not only in the shorthand used to code reading and the procedures for administering and scoring, but also in how oral reading behaviours are interpreted (Harmey and Kabuto, 2018). Miscue analysis generally focuses on older students having difficulty with reading; meaning construction is emphasized over accuracy. Whereas Running Records, designed for beginning readers, focus on accuracy and emphasize using all three sources of information (Wang et al., 2020). The theory is that the reader is checking one or more kinds of information against another (Clay, 2016: 135), and that later, this more primitive cross-checking behaviour will be superseded by self-correcting, in that subsequent attempts will correct errors (Clay, 2016: 136).
Despite the informal-sounding nature of the assessment, Clay argues that Running Records have characteristics in common with sound measurement instruments, in that it is “… a standard task with standard administration and with standard scoring procedures [that] provide sound measurement conditions” (Clay, 2013: 12). These standard procedures for administering and scoring are included in a standalone manual called Running Records for Classrooom Teachers (Clay, 2000) and in all editions of An Observation Survey of Early Literacy Achievement published over the last 50 years (Clay, 2019). No matter the context for their use, whether a classroom setting or a Reading Recovery lesson, the directions for administering, scoring, and interpreting the assessment of Running Records are the same.
Two kinds of analysis, quantifying a record and interpreting errors, are usually applied after making a Running Record. Quantifying is often used for screening purposes or progress monitoring: an assessor takes Running Records on increasingly higher-level texts until obtaining the highest level that the student can read with between 90%-94% accuracy. The highest text level is said to be the student’s instructional level. Percentages below the instructional range are considered frustration, while percentages above the instructional range are considered easy.
There is an optional, second level of analysis, called interpreting errors, that can be conducted. As Clay (2000, 2013) notes, this second analysis is informed by a theory of reading that would say beginning readers are using and neglecting information from various sources as they try to decode unknown words. This second level of analysis can inform instruction, provided the teacher shares the same view of the reading process, (Clay, 2000: 21). Table 1 provides a summary of which behaviours count as errors and whether to interpret them. In the next sections, we describe the two steps, quantifying and interpreting, in more detail.
Summary of protocol for quantifying and interpreting a Running Record.
Quantifying the Running Record
Standard codes are used to produce Running Records; they include reading accurately, repeating, substituting, omitting, inserting, baulking, appealing for help, partial attempts, and self-correcting. Of those codes (Clay, 2013: 59–62), five are tallied as errors: omissions, insertions, baulking, appeals for help, and substitutions (pp. 65–67). Substitutions for proper nouns are tallied as errors only once, no matter how many different substitutions are used for the same proper noun (Clay, 2013: 66). Sounding out accurately is neither an error nor a self-correction (Clay, 2013: 65), but making a wrong partial attempt and then fixing it, is counted as a self-correction (p. 75), as in the examples which follow: Text: He went for a walk. Student A: “He went for a w-, wa–, walk.” (Accurate reading) Student B: “He went for a run, walk.” (Self-correction) Student C: “He went for a waddle, walk.” (Self-correction)
Interpreting a Running Record
Scoring can end with Step One, quantifying, or if the assessor wants to know how the reader is using or neglecting three sources of information, a second level of analysis can be applied. Directions for this second stage of analysis are: “For every error, ask yourself at least three questions: Did the meaning or the messages of the text influence the error? Did the structure (syntax) of the sentence up to the error influence the response? Did visual information from the print, influence any part of the error: letter, cluster or word?” (Clay, 2019: 72). Note that visual information does not refer to the pictures, it refers to the print including words and subword parts.
If a self-correction is made, the error is analyzed in a two-step process: first considering what was used initially and then what extra source of information was added to influence the self-correction (Clay, 2019: 73). Returning to the earlier examples of students A,B, and C, here is how the two-step process would be applied for two students:
Text: He went for a walk.
Student B: “He went for a run, walk.” (Self-correction)
For Student B, meaning and structure probably influenced the error of run for walk. It could be inferred that in a second step, Student B then probably noticed visual information (the letters in the word walk) and used it to self-correct the error.
Student C: “He went for a waddle, walk.” (Self-correction).
On the other hand, Student C’s error (waddle for walk) probably were initially influenced by all three sources of information: meaning, structure, and visual; waddle makes sense in the story, uses an acceptable oral language structure, and shares some visual information with the word “walk”. The student then added additional visual information, going beyond the first letter, to self-correct the error. The administration directions that accompany the Running Records task in An Observation Survey specify that each and every error should be analyzed rather than looking at errors selectively (Clay, 2013: 71). After each error is analyzed, the teacher looks across the Record to interpret a pattern in sources of information used and neglected (Clay, 2013: 72). Results from this pattern analysis can then be used to guide subsequent teaching (Clay, 2013: 72).
As displayed in Table 1, not every error is included in the error tally. Omissions, insertions, baulking and appealing for help are errors but not interpreted because there is no attempt or word in the text to analyze. However, multiple attempts at one time for a word and alternative attempts for the same proper noun are errors but not included in the error tally. It’s rather problematic that the directions for interpreting “each and every” error do not distinguish between errors that are tallied or all errors. We interpret each and every error to include multiple attempts at the same word (saying “hold” then “stay” for the word “shake”) and alternative attempts for a proper noun (saying “Jack”, “Jay”, “James” at various times throughout the story for the name “Jim”), even though they are not tallied.
We find support for interpreting each attempt at a proper noun or multiple attempts at a single word in a story, in several places. First, there is no direction to do otherwise with these errors. Second, it is in keeping with the theory underlying Running Record analysis; readers who self-correct or make multiple different attempts to solve a word appear to be checking sources of information against one another in a multi-step process (Clay, 2001). Thus, interpreting each error, and not just one of the errors, is theoretically coherent with the literacy processing view underlying Running Records (see Doyle, 2013). Finally, we find examples of this interpretation in common use for Reading Recovery (Fried, 2013: 7).
It seems reasonable to conclude that careful analysis of sources of information used and neglected in a student’s errors is important, given that it might help to determine whether a reader is indeed following a pattern of neglecting a particular source of information which will inform teaching. As Clay states, reliability of scoring is important “… because we do not want to alter our teaching or decide on a child’s placement on the basis of flawed judgements” (Clay, 2013: 13). It also seems reasonable to conclude that some level of agreement is to be expected across multiple raters of the same already-coded Running Record if standard interpreting is possible.
Reliability evidence for quantifying and interpreting Running Records
Clay reported two reliability studies in her dissertation for quantifying Running Records. In a test-retest study, she coded and scored 46 Running Records using audiotapes of readings that she made two years previously during her data collection. The level of agreement she achieved for coding errors at two points in time was very high (0.98), but much lower for self-corrections (0.68). She attributed differences in her scoring of self-corrections to the two different settings, one an in-person observation and the other from an audiotape (Clay, 1966: 344).
Clay also conducted an interrater reliability test using her dissertation data. After one hour of training, five graduate students and a stenographer took 12 Running Records while listening to Clay’s dissertation audiotapes. Clay reported no significant differences between the raters’ coding and scoring of errors and self-correction and her own from two years prior (Clay, 1966: 345).
Finally, a second and more recent interrater reliability study reported high agreement (R2=.92) between two experienced assessors when they coded and scored Running Records to determine the highest instructional reading levels of 24 students. The two assessors also rank ordered the text levels by percentage of accuracy and achieved high agreement (R2=.96) on their rankings (Denton et al., 2006).
These prior reliability studies (Clay, 1966; Denton et al., 2006) are informative but also insufficient. They examined Running Record scores taken by trained raters, and all focused on the reliability of scores after raters coded and quantified the records. They also focused entirely on quantifying, not interpreting a Running Record; in fact, to our knowledge no study has been undertaken to examine the reliability of interpreting sources of information used and neglected. All three studies used a small sample of raters. Two studies are rather dated and were conducted when Clay first developed Running Records, since that time she has revised the coding and refined scoring procedures. In addition, both Clay studies measured reliability after a two-year gap of scoring Records the first time. This gap is not ideal for a reliability study because it introduces additional variance beyond differences in raters, just as Clay reported finding with her scoring of audiotapes two years after they were made. Finally, the Denton et al. reliability study does not examine the accuracy of the two assessors, only the extent of their agreement on calculating percentage of accuracy. There is a chance they both made errors in counting so that even though they agreed with each other they may both have been wrong. This paucity of research about the reliability of Running Records results is somewhat surprising, especially given the technical nature of scoring assessments and the need for accuracy and precision.
Thus, this current study expands on prior Running Record reliability work by evaluating the extent to which a group of recently-trained teachers consistently quantified and interpreted the same pre-coded Running Record. These questions guided our enquiry: How accurate are recently-trained teachers when quantifying an already-coded Running Record in terms of counting number of errors and calculating accuracy percentage, self-correction ratio, and error ratio? How accurate are recently-trained teachers when interpreting which sources of information (meaning, structure, visual) are used and neglected? Which items in research questions 1 and 2 (counting number of errors, calculating accuracy percentage, self-correction ratio, error ratio, and interpreting which sources of information [meaning, structure, visual] are used and neglected) were easy or hard for recently-trained teachers to complete correctly?
Our findings for research question 3 will allow us to go beyond addressing how reliable the teachers’ scoring was to obtain a better understanding of what items were difficult to score correctly. Together, our findings can add to the small, now dated, body of evidence about the reliability of Running Record results, and contribute new understandings about the design of Running Record assessment training that can enhance the reliability and validity of scores that are obtained.
Method
Accuracy of scoring and interpreting an already-coded Running Record was assessed based on 114 teachers’ counting of errors and self-corrections, calculation of percentage of accuracy, self-correction ratio, error ratio, and their decisions about which sources of information were used and neglected for each substitution. Teachers’ responses to each item were compared to a scoring key (shared in the Measure section), each item had one possible score: 0 if it was incorrect and 1 for correct.
Data sources
We examined extant Running Record data from a semester-long professional development initiative working with 114 teachers enrolled in graduate coursework (not Reading Recovery). Teachers learned how to use several literacy-related assessments in the professional development course; Running Records were included among them.
The plan for Running Record training included three 3-hour class lectures and teachers giving Running Records at least three times a week with a student. As such, the time allotted to learn how to administer and score Running Records was well-aligned with Clay’s recommendation to spend about “three workshop sessions” (Clay, 2013: 54) as well as Ross’s (2004) six hours of training. The training included lectures that covered these topics: recognizing the codes used when taking a running record, identifying which codes count as errors, producing a Running Record while listening to contrived transcriptions read aloud, practise calculating accuracy, the error ratio, and self-correction ratio, and interpreting Running Records to decide whether errors used or neglected meaning, structure or visual information. We used guided participation to teach each of these topics, first showing how to calculate and interpret a Running Record, and then having participants score and interpret Running Records on their own with immediate feedback. Actual Running Records were used for the training but selected purposively for the examples of errors they contain.
Near the end of the 15-week semester, approximately 10 weeks after the training, teachers were given a child’s already-coded Running Record used in this study, purposively selected for the range of errors it contained, to independently quantify and interpret during class time in a test-like setting. We call this kind of activity a simulation, in that the teachers were asked to score and interpret the coded Running Record independently, much as they would in practice. This simulation was part of the professional development curriculum for teachers to gain feedback on the accuracy of their scoring decisions and it came at the end of semester. At this time, teachers would have been using Running Records three-four times a week in the approximately 10-week span between the workshops and the simulation.
Using an already-coded Running Record allowed us to focus on the items of interest: accuracy with counting errors and interpreting the Running Record. No oral reading of the passage was added to the already-coded Running Record because we did not want to introduce an additional modality to the setting. Moreover, because Running Records are coded in a systematic way, it should be the case that another assessor can score and interpret someone else’s Running Record without hearing the child read it.
The first author’s quantification and interpretation of the Running Record served as the set of correct responses to test the accuracy of the teachers’ scoring and interpretation. The first author has experience and expertise with Running Records, having trained Reading Recovery teacher leaders and trainers in a university setting for 15 years.
Measure
The items on the Running Record that we used to measure teachers’ quantifying accuracy are presented in Table 2. The pre-coded Running Record included 16 items that assessed teachers’ accuracy in quantifying the Running Record. We refer to 4 of these 16 quantifying items as “summary items” because they involve summarizing the Record with final counts and calculations. The summary items include: (1) tally of total errors, and the calculation of (2) percentage accuracy, (3) error ratio, and (4) self-correction ratio. The other 12 items in the quantifying stage required teachers to correctly tally errors and self-corrections on the already-coded Running Record.
Quantifying the Running Record: error and self-correction scoring key.
We accepted multiple accuracies as correct. Differences were due to rounding; all possible accuracies fell within the instructional level band.
Four of these 12 items were errors, four were self-corrections, and four were oral reading behaviours that are coded but not counted as errors or self-corrections. They include: (1) subsequent errors at a proper noun or multiple attempts for one word and (2) solving behaviour where the student did not make a substitution and eventually reached the solution independently (such as reading “st-, still” for the word still). Thus, for this study, we coded 1,824 quantifying items as “correct” or “incorrect”.
We used a second set of 17 items to assess teacher accuracy in interpreting the sources of information used and neglected. Five items required teachers to interpret single incorrect attempts, some of which are not included in the error tally but are analyzed, nevertheless. Another six items required a two-step analysis, including two attempts for the same word, and four self-corrections. As explained earlier, each attempt should be analyzed in a two-step process to show what source of information the student probably used at first and then what the student probably added to make a second attempt. The 17 items and their acceptable interpretations are displayed in Table 3; four items had more than one possible interpretation. In total, we coded 1,921 interpreting items from 113 teachers’ Running Records as “correct” or “incorrect” (one teacher’s responses for interpreting sources of information on the Running Record were not legible). In all, there were 3,745 items used in the analysis.
Interpreting the Running Record: sources of information scoring key.
Analysis
We used descriptive statistics, including raw frequencies and percentages, to tabulate correct and incorrect responses for each teacher quantifying and interpreting the Running Record (research questions 1 and 2). We made a deliberate decision when we were entering data to treat unanswered questions as incorrect responses; much as an item on a test would be scored.
To measure which items were easier or more difficult for teachers to complete accurately (research question 3), we used Rasch measurement (Rasch, 1960) to transform teachers’ dichotomous scores (correct/incorrect) for each item into linear units (logits), thus developing a scale of scoring accuracy. Item difficulties and teacher scores were calibrated simultaneously and mapped onto the same scale. The sample of 114 teachers was adequate to produce stable item calibrations derived from a Rasch analysis (Linacre, 1994). Scores were rescaled so that item difficulties and teacher scores would range from 0 to 100, where 0 represents the easiest items for the teachers to score accurately while 100 represents the most difficult items. Separate models were generated for the quantification and interpretation analyses. Winsteps software version 3.68.2 (Linacre, 2009) was used for the analysis.
The Rasch model assumes there is one dimension to be measured, and for the model to fit properly, the items have to fit together in an expected manner. If the items measure one dimension, teachers with overall greater accuracy should have a greater probability of getting each item correct than teachers with less accuracy, regardless of item difficulty. Items may misfit if this expected pattern did not occur, which may indicate that an item is not measuring the same construct as the items that do follow the expected pattern.
To investigate misfits, we examined each item’s mean square outfit error value. We flagged items with mean square outfit error values that were outside the expected range of 0.5-1.5 (Linacre, 2019). Teachers’ scores were then compared across the two analyses types: quantifying and interpreting. After outputting teachers’ scores on each Rasch measure, a paired samples t-test was used to determine whether teachers’ scores differed between the two tasks.
To investigate items that we flagged for misfit, we analyzed each item’s person residuals from the Rasch analyses. These values indicate how unexpected individual responses were. For example, if a respondent with less overall accuracy correctly answered a difficult question, the residual for this observation would be large. By investigating Rasch person residuals for misfitting items, we gained insights into the patterns underlying the most unexpected responses.
Results
Research question 1: Quantifying the Running Record
Slightly less than half of the teachers, 45.6%, (n= 52), correctly counted the number of errors (four) on the Running Record; an additional 24% (n=27) of the teachers’ responses were adjacent (counting three or five errors). Fifty per cent (n= 57) of teachers correctly calculated the percentage of accuracy; an additional 10.6% were not accurate but in the right range; meaning that together, 60.1% of teachers would have correctly concluded that the student read the text with instructional level accuracy. Slightly more than one third of teachers calculated accuracies that were in the wrong band; their results were either in the hard range (23%) or the easy range (16%), instead of instructional. See Figure 1 for a summary of teachers’ calculated accuracy bands for accuracy.

Teachers’ responses to accuracy rate.
Just 4% (n=5) of teachers calculated the error ratio correctly. Another 19% (n= 22) of teachers did not calculate the error ratio at all and left that question blank. Nearly two-thirds of the teachers (n=81, 71%) correctly calculated the self-correction ratio. This higher percentage of calculating self-correction ratio compared to correctly counting errors is possible because three different counts of errors (three, four, or five) would have led to the same self-correction ratio due to rounding.
Research question 2: Interpreting the Running Record
We examined interpretations in two steps; first looking at how each initial attempt was interpreted, and then what information was added to subsequent attempt. Table 4 provides the distribution of interpretation responses for each initial item attempt; and Table 5 displays interpretations of subsequent attempts.
Accuracy in interpreting the running record: initial attempts (n=113*).
One teacher’s analysis was illegible.
Accuracy in interpreting the running record: second attempts (n=113*).
One teacher’s analysis was illegible.
Interpreting initial substitutions
Accuracy for each of the 11 items ranged from 30% - 85%; with half of those greater than 70% accuracy. Four items involved alternative substitutions for the same proper noun, Jim, in this order: Jime, Jack, Jay, James. Accuracy in interpreting the first proper noun substitution was very high at 75.2%, but then fell to 33.6%, 30.1%, 30.1%, respectively. Concurrently, the percentage of “not analyzed” rose from 10.6%, 43.4%, 51.3%, and 45.1%, respectively, for those four items. Otherwise, the percentages for not analyzing remained low for the rest of the errors ranging from 1.8% to 16.8%
Interpreting subsequent substitutions
Table 5 presents the distribution of interpretations of subsequent attempts. As explained earlier in Table 1, in the case of subsequent attempts, a two-step analysis is used to identify which source of information the child most probably added (Clay, 2019: 61, 72; Fried, 2013: 7). When the subsequent attempt was a self-correction (Items 6b, 7b, 9b, 13b), accuracy in interpreting what source was added was as high as 81.5%. For items 8b and 12b, however, when the subsequent attempt was another error, accuracy fell to 20.1% and 6.1%, respectively; while at the same time, the percentage of “not analyzed” rose to 54.0% and 80.5%. Moreover, when the subsequent attempt was interpreted, many made the mistake of selecting multiple sources of information instead of one.
Research question 3: Which items were easy or hard to complete accurately?
Before investigating the relative difficulty of the items, we examined item fit to the Rasch model. Item outfit statistics mostly fell within the acceptable range, except for four items in the sources of information analysis (6a, 11, 7a, 16) and five items in the errors/self-correction analysis (2, 5, 11, 12, 13). The error ratio showed the highest outfit by far (mean square outfit error=8.6), which indicates that this item degrades the measurement system. Other outfit statistics for misfitting items fell between .39-.43 and from 1.56-2.48.
Item difficulty: Quantifying
The four summary items, which required teachers to quantify the Running Record, were among the most difficult items to complete correctly (See Table 6) and of all the summary items, error ratio was by far the most difficult (Rasch measure=94.6). Counting errors and calculating percentage of accuracy rate were also difficult (measures= 57.4, 55.3). The easiest summary item was calculating self-correction ratio (measure=47.9).
Quantifying the Running Record: accuracy and order of item difficulty.
*Items are ordered from easiest to hardest.
Because Rasch analysis calibrates item difficulty and person ability on the same scale, it is possible to compare item difficulty against average teacher accuracy. Average item difficulty for summary items (M=63.8, SD=18.1) exceeded average teacher accuracy (M=57.7, SD=17.2).
Item difficulty measures ranged from 8.4-50.1 (SD=13.7), with higher measures indicating that the item was more difficult to analyze correctly. There were several meaningful patterns observed in item difficulty measures yielded by Rasch analysis; see Figure 2 for a visualization of quantifying item difficulties. Item difficulties tended to cluster by types of errors. For example, items 10, 12, and 16 were all difficult (50.1, 48.7, 46.2). These three items were attempts at proper nouns that should not have been counted as errors. Similarly, self-correction items were very easy, with difficulty measures ranging from 16.0-20.5; in fact, three self-correction items (6, 7, and 9) shared an item difficulty of 16.0. The only item answered correctly by all teachers, item 15 (lick/keep), was counted by everyone as an error.

Map of Rasch item difficulties for quantifying the running record. M: mean; S: standard deviation; T: two standard deviations. “#” = 3 teachers, “.” = 1–2 teachers.
Item difficulty: Interpreting
Item difficulty measures ranged from 27.0-85.0 (SD=15.1). Item difficulties for analysis of sources of information were also clustered around specific oral reading behaviours; see Figure 3 for a visualization of interpreting item difficulties. The hardest items to interpret, 12 b and 8 b, were both incorrect second attempts (Rasch measures=85.0, 68.1). These were followed by three proper noun attempts: 12a, 16, and 10 (measures=61.3, 61.3, 59.1, respectively). See Table 7 for interpreting item difficulties.

Map of Rasch item difficulties for interpreting the running record. M: mean; S: standard deviation; T: two standard deviations. Each “#” represents 2 teachers and each “.” represents 1 teacher.
Interpreting the running record: teacher accuracy and order of item difficulty.
*Items are ordered from easiest to hardest.
Item 15, which was the easiest item to analyze for errors and self-corrections, was also the easiest item for sources of information. The substitution of the word “lick” for “keep” was not only a straightforward error, but it was also clear to most teachers that the attempt was influenced by meaning and structure but not the letters (visual information). Up to the point of the error, “Duke could lick…” makes sense and sounds right, but it is unlikely that the graphemes on the page (k/e/e/p) influenced the student to say “lick”. While Item 15 showed a floor effect for errors and self-correction analysis (i.e. no teacher responded incorrectly), most but not all teachers (85%) analyzed the item correctly for sources of information.
A paired t-test was conducted to compare teachers’ performance on analyzing first attempts versus second attempts. We found that there was no significant difference between teachers’ performance on first attempts as compared to second attempts (t(112)= -.395, p=.693). However, visual inspection showed that the two items involving subsequent attempts that were still incorrect (8 b, 12 b) were also the two most difficult items for teachers to interpret.
Item difficulty: Quantifying versus interpreting
A paired t-test was performed to compare teachers' performance analyzing errors and self-corrections compared to identifying sources of information. There was a significant difference between teachers' scores on the two tasks (t(112)=-4.26, p<.001). On average, teachers scored 7.98 points higher on quantifying the Running Record than they did on the optional second level of analysis, interpreting it.
Discussion
The purpose of our study was to investigate the accuracy of teachers’ (n=114) scoring an already-coded Running Record. Previous Running Record reliability studies have focused on consistency of scoring but neglected to consider accuracy as a source of measurement error. Running Record analyses are highly technical, and the stakes may be high. Given the wide use of Running Records and the important decisions that are often made on the basis of the results, it is important to know not only how reliable results can be after a reasonable training period, but also what items are difficult to score with accuracy; such information can inform training and improve the reliability and validity of results.
Sources of difficulty
When we examined accuracy, we found a low level of precision with counting errors, calculating percentage of accuracy, and calculating error ratio. We also found low levels of accuracy when teachers interpreted sources of information used. We discuss these findings next and share why the somewhat low levels of accuracy are localized, understandable and easily addressed in training.
Counting errors with proper nouns
According to the scoring protocol, multiple different errors for the same proper noun are tallied as just one error (Clay, 2019: 67); however, many teachers in our study counted every substitution for Jim as an error, resulting in a count of three errors, instead of just one. The incorrect counting of that one type of error resulted in a lower percentage of accuracy (and a different accuracy band) and affected calculation of the self-correction ratio. This confusion with proper nouns might seem trivial and unlikely to arise often; yet many little books used with beginning readers include the names of characters that reappear throughout a story (as was the case in the story we used). On a positive note, inaccuracy in counting errors was not random but mostly confined to one type of error dealing with proper nouns and this issue can easily be addressed in training.
Interpreting errors with proper nouns
We also found that teachers interpreted only the first substitution (Jime) for the name Jim. We assume that some teachers knew that only the first substitution for a proper noun counted as an error, and then over-generalized the rule and interpreted only the first substitution Jime. In our simulation, teachers who did not interpret each error of the proper noun Jim may have overlooked helpful information about the sources of information the student was noticing and using with each attempt, and what he was still neglecting. The student seemed to know the word must be a name to make sense, and that it would start with the phoneme/j/,but he did not know how to use the rest of the letters in the word. By interpreting each successive attempt to solve the proper noun Jim, we might infer that the student is using more and more of the letters in the word while trying to keep the meaning intact.
We note that outside a simulation such as this, a teacher would have more than one Running Record to inform instruction and any conclusions we draw here from one Running Record are tentative. Nevertheless, agreement on interpreting sources of information is important and reasonable to expect. Despite the relative difficulty of the item, inaccuracies in interpreting were not random. Instead, they were centred on a specific misunderstanding that can easily be addressed in professional development.
Analyzing the use of letters (visual information)
The teachers in our study had slightly better accuracy interpreting sources of information when only meaning and structure could be said to influence the error (lick/keep, was/liked, his/Jim’s, hold/shake); they were correct 74% of the time, which is quite impressive. When visual information was implicated, however (dog/big, Jime/Jim, for/far, can/could, Jack,/Jim, Jay/Jim, James/Jim), their accuracy dropped to 53%.
Accuracy analyzing visual information when it was added to self-correct an error was better. On average, teachers correctly analyzed 68.6% of the time that the student probably added more visual information to self-correct can/could, was/liked, dog/big, and for/far. When the second attempt did not result in a self-correction, however (stay/shake and James/Jim), many teachers (54% and 80%, respectively) did not analyze the second substitution at all and therefore missed two occasions when the student was attempting to add more visual information to the first attempt.
This overall lower accuracy when analyzing visual information is somewhat concerning, given the critical role that learning how to use letters and words plays in learning to read (Adams, 2013). It is reasonable to work towards a high level of accuracy in analyzing the use of visual information if Running Record results are to reliably inform instruction.
Analyzing the source of information added on a subsequent attempt
A slightly more concerning source of inaccuracy was that many identified multiple sources of information on the second step instead of identifying a single source of information. The direction for two-step analysis is to consider what extra information is added that was not present in the first attempt. Using a two-step analysis in this way, inferring what was used initially and what was added, is meant to reflect the steps in the student’s processing of information, thus to better inform instruction. Moreover, this two-step process is contained in the protocol; deviations from the protocol are sources of error which degrade agreement across assessors and probably yield different directions for instruction.
Improving the quality of training to yield more reliable results
Scoring and interpreting a Running Record is complex work; its fine-grained analysis, compared to an IRI for example, provides feedback to inform teaching on a daily or weekly basis during a period of fast growth as young children are learning to read. The teachers in our study were newly-trained. Their accuracy was quite good and where there were problems, they were very specific, could easily be clarified, and would probably result in large improvements in future accuracy. We reflect in this section on what we learned from our study about our Running Record training and share our plans to improve for the future.
Providing rationales alongside rules
The high number of second attempts left unanalyzed was confined to second attempts that were still errors; teachers had high accuracy when the second attempt was a self-correction. Clearly, teachers could carry out a two-step analysis; they simply did not know they should do so when the second attempt was another error instead of a self-correction.
We plan to provide a clear rationale for using a two-step approach to interpreting subsequent errors. We will add the explanation that multiple attempts that do not result in a self-correction are examples of what Clay refers to as cross-checking, “a primitive form of self-correcting” (Clay, 2013: 136), in that the reader is (unsuccessfully) checking one source of information against another. The student on our Running Record tried “hold” and then “stay” for the word shake, suggesting that the student noticed the letters in the word and tried to incorporate them in a second attempt to better fit all sources of information. The two-step analysis maps neatly onto the successive processing that we infer the student is engaged with while trying to solve the word “shake”.
Knowing that self-corrections are simply multiple attempts that result in the correct word and knowing that the two-step analysis maps onto the processing that we assume is taking place, teachers will have a theoretical rationale for interpreting each attempt to solve a word. Given the likelihood that children will make more than one attempt at an unknown word while problem-solving, and given the added information gained from interpreting each attempt, we think this emphasis in training will be very worthwhile.
Removing the error ratio calculation
In contrast to the items that were omitted, the hardest item in quantifying the record (calculating the error ratio), was attempted by almost every teacher; only 22 teachers omitted this item while of those who attempted it, 87 calculated it incorrectly. Thus, in this case, we think the item was hard because the teachers were uncertain about how to calculate the error ratio; not because they thought they did not have to do it.
Going forward, we think it best to omit the step to calculate an error ratio when learning how to quantify a Running Record. The information it provides is redundant, with the information provided by the percentage of accuracy, and holds no practical meaning. Moreover, the formula to calculate error ratio is rather complicated as presented in the manual and given that the training already includes other (more meaningful) calculations, nothing of practical importance will be lost. Indeed, with one less and easily-confused formula to learn, omitting the error-ratio formula from training might even result in fewer errors when calculating percentage of accuracy.
Using more simulations
As we reflect on the design of our training, we see the enormous value contained in providing teachers with a simulation exercise. The simulation exposed for us gaps in the training we were providing to the teachers. In the future, we think it will be a good idea to use multiple simulation exercises and to do so early on, rather than waiting ten weeks after initial training. Additional Running Records can purposively be identified or created to focus on particular scoring or interpreting conventions that we know to be hard to do correctly.
We would also want to craft simulation exercises that provide explicit practice in interpreting visual information and to be specific about what parts of visual information (initial, medial or final letters, for example) are being used and neglected. We think that conducting a finer grained analysis of visual information in this way will not only sharpen skill in interpreting visual information, but also provide more specific guidance as to what to teach. With everyone working independently to quantify and interpret an already-coded Running Record, teachers receive authentic feedback about what they already know how to do and what they need to refine.
In sum, while it may seem that the assessment is far too complicated to score correctly, we note that the teachers had a very good degree of accuracy in quantifying and interpreting; and that there were just a few difficult items that can be easily resolved. Moreover, the second level of analysis, interpreting, which was rather more complex, is an optional one. Classroom teachers may choose to only carry out the first level of analysis to gain information about a student’s progress in reading.
Limitations
As with any study, there are limitations that should be considered to help guide future research and interpret the present findings. It is important to note that we used only one Running Record. Had we used more than one, our findings about what teachers found difficult to complete correctly might be different. Yet, a challenge for researchers is that even one Running Record yields many data points. Our data set with 37 items to score and 114 teachers yielded nearly 4,000 items to analyze. Nevertheless, it would be useful if a future study used more than one Running Record, adding another would no doubt yield more information and add to our understanding about teacher precision and thus the reliability of results.
It is also important to think about the generalizability of the findings. The teachers in our study had just completed their training in how to score Running Records. It is likely that more experienced administrators might be more skilled with counting errors and interpreting sources of information used. Alternatively, such a study might reveal “drift” from standard scoring and interpreting Running Records; if so, that finding would also be informative.
In sum, we think a future study using more experienced assessors and involving more than one Running Record might yield additional useful information about the reliability of scores derived from analyzing a Running Record. Such a future study could also include teacher interviews to learn how the results would inform what to emphasize next in instruction, based on their interpretation of the Running Record.
Conclusion
Assessors who use standardized assessments always require particular training and a certain amount of precision if reliable results are to be obtained, and the same is true for Running Records, an oral reading assessment said to contain all the qualities of a sound standardized assessment (Clay, 2013: 12). Our study did not examine agreement in coding oral reading behaviour; earlier studies have examined that question. We did find that about one-third of the teachers incorrectly calculated the percentage of accuracy to such a degree that their results were in the wrong accuracy band. The difficulty experienced with calculating percentage of accuracy correctly is relevant, given that Running Records are often used to inform important decisions. That said, the lack of agreement about the percentage of accuracy was largely due to not knowing how to count proper nouns, a misunderstanding revealed here that can be easily addressed in training.
The assessors in our study found it difficult to correctly interpret sources of information used and neglected, particularly when a student made more than one attempt at a word. Perhaps more importantly, they were less accurate when analyzing the student’s use of letters (visual information). This lack of agreement is concerning given that many would agree with Stanovich (1980), that beginning readers, especially those who are struggling, need to learn how to use print to decode words efficiently and well.
While we conclude that, as an assessment tool, Running Records can provide a reliable written record of a student’s oral reading that teachers can use to inform instruction, our study also reveals, common sources of error in scoring. Knowing that assessors might find these items difficult to complete accurately informs training not only of teachers but of those who train teachers to administer the assessment. Future research might also address whether a simpler set of judgements, such as removing the calculation of the error ratio, or adding specificity to which errors are interpreted, might yield more reliable results.
Finally, if it is the case that teacher knowledge about sources of information used or neglected matters to student progress (Rodgers et al., 2016; Fitzharris et al., 2008), then it is essential that we are careful when quantifying and interpreting a Running Record. Precision and accuracy in scoring, just like any assessment, will contribute to more reliable results that can better inform beginning reading instruction.
