Abstract
Data from 403 third graders were analyzed to determine relative and combined efficacy of group-administered Curriculum-Based Measures (CBMs) and Teacher Rankings of student reading and math performance taken early in the school year to predict end-of-year achievement scores. Teacher Rankings added to the power of CBMs to predict reading (R2 change = .18) and math (R2 change = .22). Combined CBMs and Teacher Rankings predicted at-risk status in reading (82%) and math (86%), based on logistic regression, and yielded strong area under the curve (AUC) statistics, defining risk status .88 (reading) and .82 (math). Surprisingly, Teacher Rankings yielded higher correlations with end-of-year scores than CBMs. Findings support using rankings as a simple, efficient strategy to add to the predictive power of CBMs readily available within a response to intervention (RTI) context and depicts a methodology school personnel can use to determine the relative/combined predictive power of CBMs and rankings. Of note, predictions based on Teacher Rankings vary across end-of-year performance levels.
Keywords
Since the passage of No Child Left Behind (NCLB; 2002) legislation and the Individuals with Disabilities Education Improvement Act (IDEIA; 2004), educators increasingly rely on assessment data to evaluate/monitor student achievement (Hosp & Ardoin, 2008; Turse & Albrecht, 2015). Both high-stakes assessments and curriculum-based measures (CBMs) occur within implementation of the Response to Intervention (RTI) model, and student achievement is routinely analyzed each academic year per standardized end-of-year state achievement testing. Typically, CBMs serve as universal screeners and progress monitoring tools to gauge the suitability of instructional strategies (Ball & Christ, 2012; Glover & Albers, 2007). In addition, CBMs are sometimes employed to predict academic success reflected in high-stakes end-of-year test scores (Deno, 2003; Merino & Beckman, 2010). Teachers’ judgments have also proven to be useful in evaluating students’ performance as several researchers have found moderate to strong correlations between various teacher rating methods and end-of-year assessment scores (Coladarci, 1986; Demaray & Elliott, 1998). In addition, there is some support for the relative effectiveness of Teacher Rankings (ranked estimates of highest and lowest performers) in predicting students’ achievement test outcomes (Ebbesen, 1968; Luce & Hoge, 1978). Although the findings in support of teacher judgments and rankings are encouraging, the results are not uniformly positive for use of judgments (Südkamp, Kaiser, & Möller, 2012) or rankings (Eckert, Dunn, Codding, Begeny, & Kleinmann, 2006), and more research is needed. This study was designed to provide additional information regarding the value of rankings, and specifically to determine the relative and combined efficacy of CBMs and rankings, two readily available operationalizations of academic performance.
Teacher Rankings
As mentioned above, CBMs are not the only tools that educators use in identifying at-risk learners and/or predicting their performance on statewide tests—teacher-based rating scales and ranking may be used as well. For example, fourth-grade teachers predicted which of their students would score in the top and bottom quartiles of the Iowa Tests of Basic Skills (ITBS; Riverside Publishing, 1985) and the Degrees of Reading Power (DRP; College Board, 1983) test with a high degree of accuracy, that is, within 60% to 70% accuracy (Gaines & Davis, 1990). And teachers’ rankings for overall reading ability were correlated significantly with students’ scores on the Wheldall Assessment of Reading Passages (r = .73; WARP; Wheldall, 2013) according to Madelaine and Wheldall (2005). Teachers and other school-based professionals might also use the rating/ranking data they produce to inform classroom decisions. For example, 75.7% of school psychologists who were asked to consider the assessment methods they had used in their last 10 cases reported using rating scales, checklists, or questionnaires with teachers and parents (Shapiro & Heick, 2004). Use of this format ranked higher than other forms of assessment. Teacher rating scales are versatile and may be used to rate students’ social skills, problem behaviors, and/or academic competence, often leading to predictions of risk status and may provide information useful for planning interventions. For example, consider the Social, Academic, and Emotional Behavior Risk Screener–Teacher Rating Scale (SAEBRS-TRS), a universal screening instrument that operationalizes risk for social, emotional, and behavioral problems for students in Grades K through 12 (Kilgus, von der Embse, Allen, Taylor, & Eklund, 2018). Classroom teachers are instructed to complete the scale for each student electronically in approximately 1 to 3 min using a 4-point Likert-type scale to rate each student’s positive and negative traits observed during the prior month.
While the use of teacher rating forms are common, the teacher ranking method has some unique benefits. In particular, the Teacher Ranking format is efficient, typically requiring approximately 10 to 15 min per class. In addition, rankings have the potential to inform prediction of high-stakes test scores, and may be more accurate than ratings because they require a forced choice. Although both measures may serve as quick and accurate predictors, K. D. Hopkins, George, and Williams (1985) speculated that ratings might mask small but discernible differences (because more than one student can receive the same score) that are revealed when each student must receive a distinct rank. Similarly, in a synthesis of 16 teacher judgment–related studies, Hoge and Coladarci (1989) surmised that Teacher Rankings required teachers to apply a greater measure of thought and specificity to their judgments than did teacher ratings. For example, comparing and contrasting each student’s performance relative to his or her peers when predicting achievement scores can require more acute discrimination than when merely assigning a point value based on a rating scale. Their study also revealed a stronger correlation between standardized achievement test scores and Teacher Rankings (r = .76) in comparison with the more commonly used teacher rating method (r = .61).
Hopkins and colleagues also examined the relative value of teacher ratings and teacher rankings as a means of determining the concurrent validity of the Comprehensive Test of Basic Skills (CTBS; CTB/McGraw-Hill, 1981). The Teacher Ranking process required that teachers use a class roster to rank their students’ reading capacities in order of “best to poorest.” Teacher ratings were presented in a Likert-type-scale format, operationalizing the academic achievement of each student in subject areas measured on the CTBS. Although both measures were significantly related to standardized reading (CTBS) test scores, Teacher Rankings reflected “significantly higher” concurrent validity coefficients than did teacher ratings.
Begeny, Krouse, Brown, and Mann (2011) further considered the value of teacher ranking and rating measures in their study evaluating the accuracy of teacher’s judgments. Teachers were asked to evaluate their students’ oral reading fluency (ORF) abilities on the nine-item Teacher Rating Scale of Reading Performance (TRSRP; Begeny, Eckert, Montarello, & Storie, 2008). They were also asked to use a classwide percentile-ranking chart to rank order each of their students’ oral reading performance on the Dynamic Indicators of Basic Early Literacy Skills (DIBELS) Oral Reading Fluency (DORF) measure in relation to how they compared with other participating students in the same class. Teacher judgments were then correlated with the students’ DORF scores, and, while moderate, the relation between students’ actual percentile ranks and estimated teacher rankings (.56, p < .01) were stronger than the relation between students’ words correct per minute on grade-level DORF material and ratings of their reading skills using the teacher rating scale (.43, p < .01).
CBMs
CBMs are sometimes described as brief, psychometrically sound measures that provide quick glimpses into instructional effectiveness and the need for additional remedial learning assistance (e.g., Espin, Wallace, Lembke, Campbell, & Long, 2010). In addition, CBMs may be used to predict success on end-of-year standardized assessments (Deno, 2003), particularly for reading.
Reading
From the beginning, proponents of CBMs have prioritized strategies to enhance their psychometric properties, including their ability to predict end-of-year standardized test scores (Goetze & Burkett, 2010; Kettler, Glover, Albers, & Feeney-Kettler, 2014). For example, Hunley, Davies, and Miller (2013) found a large correlation between outcomes on a CBM for ORF and scores on the Ohio Grade 7 Achievement Tests (.76, p < .001). Similarly, Nese, Park, Alonzo, and Tindal (2011) administered easyCBM (Alonzo, Tindal, Ulmer, & Glasgow, 2006), oral passage reading fluency (ORF), vocabulary (VOC), and multiple choice reading comprehension (MCRC) probes to 1,800 students in the fourth and fifth grades, and examined the relation between these outcomes and the state’s standardized reading measure, the Oregon Assessment of Knowledge and Skills (OAKS; Oregon Department of Education, 2010), a comprehensive measure of general reading skills. As was the case from the Nese et al. study, these CBM measures provided significant predictions when end-of-year achievement test scores were the criterion.
Results from two meta-analyses further inform the nature of the relation between CBM scores and high-stakes test scores. Reschly, Busch, Betts, Deno, and Long (2009) examined the relations between CBM oral reading measures and other standardized tools measuring reading assessment among first- through sixth-grade students. Data from 41 unique studies confirmed several findings: Reading CBMs significantly predict reading scores on statewide tests, individually administered reading CBMs are more significant predictors of statewide test scores than those administered in groups, reading CBMs are strong predictors of standardized reading score outcomes for third graders (with no statistically significant differences in predictive utility between first, second, fourth, and fifth graders), and reading CBMs are a significant predictor of standardized test score outcomes when both are measured within the same academic year. Similarly, from the second meta-analysis of 27 studies, Yeo (2010) reported a large population correlation of .69 between reading-based CBMs and end-of-year reading score outcomes.
Perhaps the most widely used CBMs are DIBELS (Good & Kaminski, 2013) and AIMSweb (2011). Although both measures are characterized by good psychometric properties, they are not without criticism and may not provide the most efficient assessment strategies. To address the need for a more efficient group-administered instrument of reading fluency and comprehension, the Monitoring Instructional Responsiveness: Reading (MIR: R) was developed (Bell, Hilton-Prillhart, McCallum, & Hopkins, 2010; see below for more information about development of MIR: R). K. C. Miller, Bell, and McCallum (2015) examined the relation between third-graders’ MIR: R scores and end-of-year standardized test scores on the reading composite of the Tennessee Comprehensive Assessment Program (TCAP; Tennessee Department of Education, n.d.), a state-required test administered annually to students in third through eighth grades. The MIR: R probes include narrative and expository texts that do not contain capitalization or punctuation and require students to place a slash mark at the end of each sentence to demonstrate their understanding of where one idea begins and ends. This unique format permits administrators to obtain measures of both reading fluency and reading comprehension in a time efficient manner (3 min) using only one instrument, unlike other commonly used measures that may require the administration of multiple subtests to acquire scores in both domains, such as DIBELS Next (Good & Kaminski, 2013) and AIMsweb© (Shinn & Shinn, 2002). Participants’ MIR: R Comprehension Rate scores (Number of ideas correctly identified divided by attempted identification of ideas × 100) were moderately correlated with their TCAP reading performance (.58, p < .01), with sensitivity and specificity percentages of 85 and 52, respectively.
Math
Similar to the situation for informal reading assessment, there are a number of math-based CBM strategies and related literature and, like reading, researchers have focused on the relation between math-targeted CBMs and end-of-year standardized achievement test scores, but with less consistent results.
Shapiro, Keller, Lutz, Santoro, and Hintze (2006) reported strong correlations (.62 to .69, p < .001) between reading and math CBM data and the Pennsylvania System of School Assessment (PSSA; Pennsylvania Department of Education, 2003) scores for nearly 3,000 elementary-age students, but concluded that the relations between CBMs measuring math computation and math concepts and the PSSA were small to moderate (.07 to .48, p < .001) during the fall assessment period, and medium to large (.50 to .64, p < .001) during the winter and spring screening periods. According to Jiban and Deno (2007), the relation between CBMs measuring basic and cloze math facts and scores on the Minnesota Comprehensive Assessment in Mathematics (MCA-Math; Minnesota Department of Education, 2007) among suburban third and fifth graders was moderately correlated, with CBM-based cloze math facts more strongly predictive of MCA-Math scores than the measure of basic math at the fifth-grade level.
Others have reported similar results, with most relations in the medium to large range (e.g., Christ, Scullin, Tolbize, & Jiban, 2008; Helwig, Anderson, & Tindal, 2002), and with a couple of salient conclusions—students from general education classrooms generally obtained higher scores on both the CBM and end-of-year measures than students from special education classroom settings, but still exhibited moderately strong correlations; and solely assessing math computation limits utility to predict overall mathematics achievement.
McCallum et al. (2013) studied the efficacy of a CBM-based screening model intended to identify students who might qualify as twice exceptional (i.e., those who met the criteria for both a learning disability and academic giftedness) using Monitoring Instructional Responsiveness: Math (MIR: M; McCallum, Hopkins, Bell, & Hilton-Prillhart, 2010). MIR: M is a 3-min CBM that mirrors the group assessment format of MIR: R and, similar to MIR: R, it was designed to be efficient. Within a 3-min assessment format, it operationalizes both math calculation and math reasoning skills in elementary-age students. One goal of the study focused on determining the power of the MIR: M to predict students’ membership into two groups: students whose Total Math TCAP scores were in the 16th percentile and above and those whose scores were at or below the 84th percentile. MIR: M predicted those third graders classified in the top performing group with 60% accuracy and those in the bottom performing group with 72% accuracy (see below for more details regarding development of and the psychometric properties of MIR: M).
Rationale for the Study
Educators are motivated to seek methods of determining which of their students are at risk for potential failure on high stakes achievement tests. Early and efficient identification of these students should enable teachers to prioritize interventions for those who need it most. Given that CBMs are typically available to teachers already within RTI (or multitiered systems of support), models based on these measures can provide efficient prediction based on the literature explicating the relation between CBMs and end-of-year scores. Thus, the criterion of efficiency is met. However, could prediction be improved by adding to the equation another efficient data gathering strategy? Based on the literature, teachers’ rankings of students’ academic success within their classrooms have the potential to predict later success on standardized measures. Therefore, perhaps both CBMs and Teacher Rankings can offer useful information regarding at-risk status. Given that both have the potential to inform risk status, educators may be curious about their relative predictive power. For example, do CBMs provide more predictive power than rankings, or vice versa? In combination, could the information provided be even more predictive than using either alone? This study was designed to answer those and related questions (i.e., to determine the relative and combined power of Teacher Rankings and RTI-based CBM measures for predicting math and reading performance on an end-of-year high-stakes measure [the TCAP]). If the goal is to identify the best predictive model for determining those individuals who are really at risk (sometimes referred to as sensitivity) and those individuals who truly are not at risk (sometimes referred to as specificity), determining the relative and combined efficiency could be informative to educators who have to prioritize resources to meet students’ needs (i.e., implementing interventions that have the capacity to “break the prediction” by improving the overall academic success of at-risk students). The following research questions address the specific relevant goals of this study:
Method
Participants and Setting
Participants included all 403 third-grade students from 28 classrooms in eight elementary schools in one rural school district within the Southeastern United States and their 28 teachers. The average class included 14 students. Fifty-nine percent of the students were categorized as economically disadvantaged, 49.9% were male, and 50.1% were female. Also, 92.1% of the students were White and 9.2% were classified as “Other.” Institutional review board (IRB) permission was obtained before implementation of the study.
Instruments
MIR: R
MIR: R (Bell, Hilton-Prillhart, McCallum, & Hopkins, 2010) and MIR: M (McCallum et al., 2010) probes are CBMs administered to groups of students in 3 min. The MIR: R and MIR: M each have four universal screeners and 18 alternate-form progress monitoring probes for students in Grades K-5. For these analyses, only composite scores were used from one probe, the first universal screener, administered early in the third-grade year. Each MIR: R screener and probe (for Grades 2-5) consists of four alternating narrative- and expository-text passages, each containing 10 sentences that include words from high frequency word lists and content and VOC from state science and social studies standards for third-grade curriculum, and written at an end-of-grade Spache readability level (i.e., grade-level 3.9 for third-grade-level probes). Passages contain no end punctuation marks (e.g., periods, exclamation points, question marks) and no capital letters to signify the beginning of sentences. Students are to indicate where one idea ends and another begins, then make a slash mark in this place. The MIR: R total score (Comprehension Rate) was used as the Reading CBM measure in this study. Comprehension Rate is an amalgam of number of words read silently and a student’s ability to correctly identify a certain number of ideas within a specified time period (i.e., within 3 min); it is a measure of reading rate mediated by comprehension.
Large correlation coefficients help establish the reliability of the MIR: R. For adjacent probes (administered about 2 weeks apart), the average reliability was .80, p < .001 (Hilton-Prillhart, 2011). In addition, concurrent validity estimates between MIR: R and AIMsweb© Maze ranged from .43 to .55; the coefficient between MIR: R and the STAR Reading Assessment (Renaissance Learning, 2015b) was .67 (Hilton-Prillhart, 2011). Hilton-Prillhart compared the predictive utility of MIR: R and AIMsweb© Maze scores to estimate end-of-year STAR scores and, using a stepwise multiple regression, found that MIR: R scores predicted 37% of the variance in the STAR scores and was the most powerful predictor; AIMsweb© scores failed to produce additional predictive variance. Other reliability and validity data for MIR: R Grades K through 5 may be accessed from McCallum et al. (2013) and K. C. Miller et al. (2015).
MIR: M
The MIR: M includes items that assess in an alternating fashion both math calculation (operations, greater than and less than) and reasoning (number series, shape patterns); content for third-grade screeners and probes is based on end-of-year expectations for math proficiency; the items require no reading to complete. The score used for these analyses is the total of both Math Calculation and Math Reasoning items correct. M. B. Hopkins (2010) reported that correlations between various alternate forms of MIR: M probes ranged from .59 to .80 at third grade, but these reliability estimates are artificially low because of the increased time between administrations of many of the pairwise coefficients (i.e., values obtained between those taken early in the year with probes taken later in the year). Hopkins reported a test–retest reliability of .76 for third grade, within a 2-week interval. In addition, external validity data were gathered for the MIR: M. Correlation coefficients between the MIR: M and Monitoring Basic Skills Progress: Computation (MBSP; Fuchs, Hamlett, & Fuchs, 1999) show adequate concurrent validity estimates ranging from .58 to .75 for third grade. Stepwise multiple regression results show that the MIR: M provided stronger relative predictive power than did the MBSP when predicting end-of-year STAR Math (Renaissance Learning, 2015a) scores, explaining 33% of the STAR Math variance. McCallum et al. (2013) provides additional psychometric data on the MIR: M.
TCAP Achievement Test
The TCAP Achievement Test is administered annually in late spring to students in Grades 3 to 8, as required by the state of Tennessee Department of Education. The TCAP is a timed, criterion-referenced, standardized test and employs a multiple-choice format across five different content areas: Reading, Language Arts, Mathematics, Science, and Social Studies. Internal consistency reliability coefficients are reported as typically ranging from .95 to .96 (N. Miller, DeLapp, & Driscoll, 2007). In addition, concurrent validity coefficients resulting from the analysis of TCAP and AIMsweb© Oral Reading and Maze subtests reported by Yeo, Fearrington, and Christ (2012) ranged from .51 to .75. TCAP scores used in this study included the composite scaled scores. TCAP Reading comprises the following subscales: Language, VOC, Writing and Research, Communication and Media, Logic, Information Text, and Literature. TCAP Math comprises the following subscales: Mathematical Processes, Number and Operations, Algebra, Geometry and Measurement, and Data Analysis, Statistics, and Probability.
Teacher Rankings
Classroom teachers completed a simple ranking of students’ performance needs in reading and math, according to the following instructions:
Please refer to your current class roster. Consider each student’s performance and achievement in reading and math since the beginning of the school year. Think about how each student performs in daily work, on assignments, activities, projects and tests. Rank order your students in order of need, from lowest to highest for both reading and math. Suppose you have 20 students in your class. You will rank the student with the lowest need No. 1 and the student with the highest need No. 20. Rankings in reading and math should be independent. A student can have a high need in one but not the other. Base your rating on your own experiences with each student, regardless of whether he or she receives extra tiers of instruction and/or special education. We realize rating some students will be hard; just give us your best judgment. Thank you!
Procedures
MIR: R and MIR: M universal screeners were administered in counterbalanced order across classrooms in group format by their teachers who had been trained to follow a script. The CBMs were administered directly to the students. Their teachers completed the rankings according to the directions above. Data were gathered in the second year of implementation of RTI and use of the MIR in this district; consequently, teachers were well versed in administration procedures. The first MIR universal screeners were administered at 6 weeks into the school year; teachers completed the ranking forms the same week. Furthermore, the teachers did not see the raw scores, they did not score the screener, and they were not apprised of the results until at least 2 weeks after administration. All ranking data were transferred to a data file and rankings were yoked to the students’ CBM scores using a numerical code.
Objective scoring guides for both MIR: R and MIR: M were used to ensure objectivity. Teacher Ranking data were entered into a database by a district technical assistant. Study authors, advanced school psychology students, and special education graduate students checked the hard-copy protocols for accuracy of scoring and data entry, which yielded an error rate below 1%. TCAP achievement tests were administered by classroom teachers and scored according to State Department of Education criteria.
Data Analyses
Once the data for the study were organized, data cleaning procedures were used as described by Morrow and Skolits (2015). First, the data were examined for the percentage of missing values for each variable. Due to the observed randomness of missing data and to keep consistency for analysis and reporting purposes, researchers decided to use the student records that had complete data for all independent variables (continuous, MIR: R, Teacher Ranks: R, MIR: M, Teacher Ranks: M, and indicator variables: class/teacher, school) and dependent variables (TCAP-Reading and TCAP-Math). The data set was then divided into two files (Reading and Math) with related variables for two separate analyses (Reading and Math).
Prior to conducting descriptive and inferential statistical analyses, the entire data set was examined for outliers using standardized scores. Based on Tabachnick and Fidell’s (2013) criteria of standardized values falling outside the range of −3.29, +3.29, interval outliers were identified for all predictor and dependent variables under study. Three outliers in reading and four in math were excluded from the analysis by deleting the entire observation from the data set prior to the analysis and resulted in 403 observations that had data for all variables considered in this study.
Scatter plots were constructed, and correlation coefficients were computed to examine the presence and magnitude of the linear relationships among study variables (independent and dependent). To address research questions that are primarily focused on relative and combined predictive power (Questions 1 and 2), we considered zero-order correlation coefficients and regression-based models (in particular, sequential regression).
Given the nested structure of the data, we initially considered a multilevel analysis even though only 28 teachers participated, which is fewer than is typically recommended for multilevel modeling (see Hox, 2010). We began additional data analysis by calculating the ICC for a two-level model for both reading (.08) and math (.07), leading to an analysis of the Design Effect (reading, 2.04 and math, 1.91). ICCs were less than .10 (i.e., less than 10% of variance in TCAP scores are attributed to the difference in teachers) and the Design Effect was approximately equal to or less than the recommended value of 2.0. That is, neither ICC value reached the 10% benchmark that has been suggested for conducting multilevel analysis (Charlton, 2013; Grace-Martin, 2013; Kianoush & Masoomehni, 2015) and the Design Effect values are only borderline acceptable. Consequently, we conducted classical single-level model analyses and tested the violation of the autocorrelation assumption using the Durbin and Watson statistic. For both reading and math, Durbin and Watson values were within the range of 1.5 to 2.5, indicating indirectly that the data do not violate the independence of error assumption (see Garson, 2012; Gujarati, 2009; Haery, Bahrami, & Haery, 2013; Tabachnick & Fidell, 2013). Given that no significant violation was observed, the single-level model with the nested data was applied. This analysis is consistent with our primary research goal of quantifying the predictive relationship between measures (MIR and Teacher Ranks) on TCAP scores at the student level.
To evaluate Questions 3 and 4, sequential multiple regression analysis was performed with MIR scores entered into the model as the first variable, and then Teacher Ranks variables were added above and beyond MIR scores. Residual analysis was conducted to check for any violation of assumptions and to make necessary modifications to the model. To create the most practical predictive model, we dichotomized TCAP scores (dependent variable) within categories/levels “at-risk” or “not at-risk” based on the different TCAP cut score percentiles. Sequential logistic regression analysis was later performed to estimate the model parameters and obtain predicted probabilities (at risk) for each student. To obtain sensitivity and specificity values, ROC curve analysis was preferred over logistic regression analysis as the latter analysis provides such values only for specified classification cutoff probability (commonly 0.5), whereas the former analysis provides values for all possible classification cut off points. Furthermore, predicted probabilities from logistic regression were used as the independent variable for the ROC curve analysis with MIR scores only and with the MIR scores plus Teacher Ranks. When both MIR score and Teacher Ranks were used as independent variables (screeners), predicted probabilities acted as proxy for the combined effect (one variable), and this variable facilitated calculating a conventional ROC curve (rather than conducting a more complex ROC analysis with multiple independent variables).
Results
Descriptive statistics for the MIR: R and MIR: M universal screener scores, Teacher Rankings, and TCAP scores are presented first. Relations between MIR scores and teacher rankings are depicted via zero-order correlation coefficients. Results from analyses used to investigate the relative predictive power of the MIR: R and Teacher Rankings follow. Molar analyses addressed the relative predictive power of MIR: R and Rankings across all teachers; molecular analyses addressed the relative predictive power of the two independent variables by teacher via a hierarchical regression analysis. Finally, the prediction of at-risk status was determined using both predictor scores, with TCAP as the dependent variable, by examining ROC, which used predicted probabilities obtained from logistic regression.
Descriptive Statistics of MIR: R and MIR: M, Teacher Rankings, and TCAP Scores
Descriptive statistics for the 403 participants were calculated for each of the MIR scores (first universal screener), Teacher Rankings, and for the TCAP reading composite scale score. MIR: R scores ranged from 8 to 297 and the mean MIR: R score is 101.18 (SD = 65.98). The number of students in 28 classes ranged from 13 to 21; the average class size was 14. Finally, the TCAP Reading Composite scale scores ranged from 670 to 879 and with a mean of 758.27 (SD = 27.88). MIR: M scores ranged from 6 to 69 and with a mean MIR: M score of 25.55 (SD = 8.84). Finally, the TCAP Math Composite scale score ranged from 641 to 900 with a mean of 755.45 (SD = 32.89).
Research Questions 1 and 2: Prediction of TCAP Scores From MIR and Teacher Rankings
The predictive relation between the MIR: R scores and the TCAP Reading Composite and between the Teacher Rankings (Reading) and the TCAP Reading Composite, can be expressed as Pearson product–moment correlation coefficients; correlation coefficients are .51 (p < .01) and −.64 (p < .01), respectively. A higher Teacher Ranking score indicates higher risk status, hence, the negative correlation between Teacher Rankings and achievement. To determine the variance shared between the TCAP and the predictor variables, coefficients of determination (r2) were calculated, yielding .26 for MIR: R and TCAP Reading Composite scores, and .41 for Teacher Rankings (Reading) and TCAP Reading Composite. So, the MIR: R accounted for 26% and Teacher Rankings accounted for 41% of the variance in the criterion (TCAP Reading Composite).
The statistics reported in the above paragraph indicate predictive power at the molar level (i.e., for the entire sample [r = .51 and −.64]). To determine the extent to which prediction varies as a function of performance on TCAP scores, additional analyses were conducted. Correlation coefficients adjusted for restriction in range were calculated for students within categories operationalized by TCAP percentiles (i.e., those with percentile ranks between 1 and 15, those between 16 and 30, those between 31 and 45, those between 46 and 60, and so on). As is apparent from Table 1, predictive capacity varies across these groups for both Teacher Rankings and MIR: R. Teacher Ranking prediction is best for those who earned percentile ranks between 1 and 15. In general, the lowest coefficients were obtained for the two highest groups, those with TCAP percentiles greater than 75. In general, teachers seem to have more difficulty ranking readers who perform best, but the second category also created ranking difficulty. The TCAP predictions from the MIR: R were modest across all categories, ranging from .16 to .34.
Relations Between Teacher Rankings and MIR: Reading and Math Scores and TCAP Scores as a Function of TCAP Performance.
Note. Correlation coefficients are adjusted for restriction in range. MIR = Monitoring for Instructional Responsiveness; TCAP = Tennessee Comprehensive Assessment Program.
The association between the MIR: M score and the TCAP Math Composite and Teacher Rankings (Math) and the TCAP Math Composite were also calculated, expressed as correlation coefficients, and they are .39 (p < .01) and −.57 (p < .01), respectively. To determine the variance shared between the TCAP and the predictor variables, coefficients of determination (r2) were calculated, yielding .15 for MIR: M and TCAP Math Composite scores and .32 for Teacher Rankings (Math) and TCAP Math Composite scores. The MIR: M score accounted for 15% and Teacher Rankings accounted for 32% of the variance in the TCAP Math Composite scores.
As is apparent from Table 1, predictive capacity varies across these groups for math just as it did for reading. Teacher Rankings are best for those who performed worst on the TCAP and worst for those in the intermediate performance categories. All the adjusted coefficients between MIR: M and TCAP scores were low to modest. In general, teacher rankings predicted TCAP scores much better than MIR: M for those whose TCAP scores were lowest, but predictions from Teacher Rankings and MIR: M were more similar for those in the remaining categories.
Predictive Power of Teacher Rankings Beyond MIR: R Scores
Initially, the predictive relation between the MIR: R scores, Teacher Rankings, and student performance on the TCAP Reading Composite was determined via a sequential multiple regression focusing on individual student outcomes across all teachers. CBM data (e.g., MIR: R scores) were entered into the regression equation first because teachers within schools that adopt the RTI model typically have access to CBM scores on hand already. MIR: R scores predicted 30% of the variance in TCAP scores, R = .55, Adj. R2 = .30, F(1, 400) = 172.63, p < .001. Adding Teacher Ranking beyond the MIR: R scores significantly improved the prediction of TCAP Reading Composite scores (R2 change = .18, F change = 136.79, p < .001), explaining 18% of the additional variance. The combination of two predictors explained 48% of the variance in the TCAP Reading Composite score, R = .69, Adj. R2 = .48, F(2, 399) = 184.81, p < .001.
Similarly, the predictive relation between the MIR: M scores, Teacher Rankings, and student performance on the TCAP Math Composite was determined via a second sequential multiple regression using the same analytical techniques described for the reading data. The MIR: M score was entered into the regression equation first and predicted 16% of the variance in TCAP scores, R = .40, Adj. R2 = .16, F(1, 399) = 77.13, p < .001. Adding Teacher Rankings beyond the MIR: M scores significantly improved the prediction of TCAP Math Composite scores (R2 change = .22, F change = 142.16, p < .001) while explaining 22% additional variance. Combining the two predictors explained 38% of the variance in TCAP Math Composite scores, R = .62, Adj. R2 = .38, F(2, 398) = 129.29, p < .001.
Research Questions 3 and 4: Accuracy of Diagnosing At-Risk Status
A bivariate logistic regression model was computed to determine the probability of risk status using MIR: R scores and Teacher Rankings as predictors, and the two categories of risk status were determined (i.e., at risk and not at risk) based on TCAP scores as the criterion. Students who scored lower than some percentile (e.g., 10%) on the TCAP Reading Composite were classified as at risk. Cutoff scores ranging from the lowest 10% to the lowest 30% were used for exploratory purposes, given that there is no universally accepted “standard” for determining at-risk status (i.e., cut scores will vary across districts given availability of resources and philosophy); results are presented in Table 2.
Contribution From MIR: R and Teacher Rankings to Predict At-Risk Status—Logistic Modeling Summary.
Note. MIR: R = Monitoring Instructional Responsiveness: Reading.
Hit Rate is the percentage correctly classified as at risk or not at risk.
Sensitivity represents the percentage of actually at-risk students classified as at risk based on the models.
p < .01. ***p < .001.
For all cut off values, based on the logistic regression models, MIR: R scores significantly predicted at-risk status. Adding Teacher Rankings beyond the MIR: R score as a predictor significantly improved the prediction of at-risk status. The highest sensitivity value was observed using the 25% cut off score, when both MIR: R scores and Teacher Rankings were included. Specifically, using 25% as the cut off score, MIR: R scores significantly predicted at-risk status,
ROC curve data provided a measure of goodness of fit and was used to evaluate the fit of a logistic regression model based on the simultaneous measure of sensitivity (True positive) and specificity (True negative) for multiple cutoff points (Fawcett, 2006; Linden, 2006). To further explore the diagnostic accuracy of using MIR: R and MIR: R with Teacher Ranking, ROC curves were generated using predicted probabilities obtained from the logistic regression analysis (Takahashi, Uchiyama, Yanagisawa, & Kamae, 2006) for various TCAP cut scores. Predicted probabilities combined the effect of MIR: R and Teacher Rankings into one measure, a requirement for conventional ROC curve analysis, as opposed to the more complex hierarchical ROC curve analysis, which is conducted from multiple independent variables. For all TCAP cut off values, adding Teacher Ranking data to the MIR data increased the likelihood of successfully screening those students who are truly at risk, and the highest specificity value (for fixed sensitivity of 0.9 and prevalence 0.1) was obtained when using the 25% cut off value. Complete results of the ROC analysis are presented in Table 3. Using the 25th percentile definition of at-risk status, the area under the curve (AUC) only using MIR: R is .789, which is indicative of a fair diagnostic test, according to Hanley and McNeil’s (1983) benchmark values (1 = perfect, 0.9-0.99 = excellent, 0.8-0.89 = good, 0.7-0.79 = fair, 0.51-0.69 = poor, 0.5 and below = worthless). Adding Teacher Ranking along with MIR: R to the ROC curve analysis significantly increased the AUC value to .881 (good; p < .01) and yielded a Specificity percentage of .706 (correct identification of non-at-risk status) and a Sensitivity percentage of .90 (correct positive identification of at-risk status).
Diagnostic Efficiency (Sensitivity 0.9 and Prevalence 0.1) of MIR: Reading and Teacher Rankings.
Note. MIR = Monitoring for Instructional Responsiveness. AUC = area under the curve; CI = confidence interval; PPP = positive predictive power; NPP = negative predictive power.
Similarly, a bivariate logistic regression was also used to determine the classification accuracy of at-risk status for math using MIR: M scores and Teacher Rankings as predictors; results are presented in Table 4. For all cut off values, adding Teacher Rankings beyond the MIR: M score as a predictor significantly improved the prediction of at-risk status. Apparently, using 15% as a cut off score provided the most powerful prediction, and adding Teacher Rankings beyond the MIR: R score as a predictor significantly improved the prediction of at-risk status, with a change of
Contribution From MIR: M and Teacher Rankings to Predict At-Risk Status—Logistic Modeling Summary.
Note. MIR: M = Monitoring Instructional Responsiveness: Math.
Hit rate is the percentage correctly classified as at risk or not at risk.
Sensitivity represents the percentage of actually at-risk students classified as at risk based on the models.
p < .01. ***p < .001.
ROC curve analysis based on predicted probabilities obtained from logistic regressions were used to further evaluate and compare the performance of MIR: M and MIR: M with Teacher Ranking for diagnosing at-risk status. For all cut off values, adding Teacher Ranking data to the MIR data increased the likelihood of successfully screening those students who were truly at risk and the highest specificity value was obtained when using the 15% cut off value (results are presented in Table 5). Using the 15th percentile definition of at-risk status, the area under the curve using MIR: M alone is .683 (considered poor as a diagnostic test, Hanley & McNeil, 1983). Adding Teacher Ranking along with MIR: M to the ROC curve analysis significantly increased the AUC value to .829 (good; p < .01) and yielded a Specificity percentage of .696 (correct identification of non-at-risk status) and Sensitivity percentage of 0.9 (correct positive identification of at-risk status).
Diagnostic Efficiency (Sensitivity 0.9 and Prevalence 0.1) of MIR: Math and Teacher Rankings.
Note. MIR = Monitoring for Instructional Responsiveness; AUC = area under the curve; CI = Confidence Interval; PPP = positive predictive power; NPP = negative predictive power.
Discussion
The purpose of this study was to determine the degree to which Teacher Ranking data and CBM: R and CBM: M data can be used independently and in combination to predict at-risk status for students who participate in end-of-year, statewide testing. Considering the high stakes associated with standardized testing, information providing early insights into students’ academic strengths and challenges is both useful and far reaching, as outcomes affect not only students, but their teachers as well. For example, early identification of at-risk students may dictate timely implementation of effective academic interventions. Overall, results of this study provide evidence in support of employing both CBM and Teacher Ranking data in predicting end-of-year standardized test scores.
As the results from analyses linked to Research Questions 1 and 2 show, the predictive power of the MIR: R and Teacher Ranking data reflects moderate independent relations with TCAP Reading Composite scores; the combined power of both predictors significantly increases their predictive efficiency. For example, MIR: M and TCAP Math Composite correlations are relatively low, and Teacher Ranking (Math) and TCAP Math Composite scores are only moderately related, but their combined statistical power yields a salient increase in predictive efficacy. Results support stronger predictive relations among the reading measures than the math measures, similar to results reported by Christ et al. (2008); that is, math CBMs may be less capable as predictors than reading CBMs. These findings could be related to possible masking effects caused by high language loadings on end-of-year standardized test items. As Bell, Taylor, McCallum, Coles, and Hays (2015) indicated, TCAP Math scores are influenced by reading because many items present math problems within the context of narrative-based scenarios. The language contamination may diminish the relation between MIR: M and TCAP-M. In the next section, we relate findings to some previous research.
Relative Predictive Utility of Teacher Rankings and CBMs
Some research supports the value of teachers’ perceptions in predicting students’ academic success, and results from this study are consistent with those findings. Specifically, moderate relations between Teacher Rankings and TCAP Reading Composite scores and TCAP Math Composite scores were obtained and, in fact, were more powerful predictors of both reading and math than CBMs. By comparison, Demaray and Elliott (1998) examined the relation between teacher judgments and students’ performances on an objective achievement test, and noted that these operationalizations were highly correlated, r = .70 (p < .001). Similar studies examining the relation between Teacher Rankings and teacher judgments have shown strong and positive correlations between Teacher Rankings and standardized test scores, and yielded data demonstrating the strengths and utility associated with the use of Teacher Rankings as predictors of high-stakes test outcomes (Hoge & Coladarci, 1989; Madelaine & Wheldall, 2005).
However, not all the literature is as positive. For example, Eckert et al. (2006) reported mixed findings when examining the efficacy of teacher judgments versus CBM data. Specifically, researchers investigated teachers’ accuracy in evaluating student achievement in the areas of math (addition and subtraction) and reading (first through fourth grade reading passages) and also required them to make judgments about first- through fifth-grade students’ frustrational, instructional, and mastery levels. When researchers compared teacher judgments and CBM data, they determined that teachers were generally accurate in estimating their students’ basic math skills and reading achievement limitations, but were less accurate when making judgments about the frustrational, instructional, and mastery levels of their students. One caveat is important to mention. Even though Teacher Rankings can provide strong predictions, prediction accuracy is related to individual teacher proficiency. While the literature supports the conclusion that most teachers seem capable of providing accurate predictions, some are not. Future researchers should address the characteristics that affect accuracy.
Combined Predictive Utility of Teacher Rankings and MIR: R/MIR: M
The continued use of both the RTI model and high-stakes end-of-year statewide testing both serve to reinforce the need for early identification of at-risk learners and the implementation of empirically based interventions. Valid and predictive measures such as CBMs similar to the MIR:R/MIR: M combined with teacher rankings are plausible approaches to obtaining relevant information regarding students’ academic mastery levels. Current results support the conclusion that Teacher Rankings significantly add to the prediction of reading group achievement results beyond reading CBM scores; similar results were found using math group achievement outcomes. The combined predictive power of Teacher Rankings and reading and math CBM scores predict high-stakes, end-of-year testing scores with moderate to strong accuracy and have the potential to increase educators’ abilities to readily identify students who need curricular supports.
Finally, it is somewhat surprising that Teacher Rankings yielded higher sensitivity in predicting group achievement scores than CBM measures. Obviously, the importance and utility of teacher interactions with students in predicting academic success should not be underestimated, nor should the ability of teachers to rank their students, particularly given that these data can be obtained quickly and easily.
Predicting At-Risk Status: Practical Applications
The methodology employed to address Research Questions 3 and 4 is one school personnel could adopt to efficiently predict students who are at risk for poor performance on end-of-year achievement tests. Within the RtI context, CBM data are already collected early in the academic year (i.e., the first whole-school universal screener) and teacher rankings could be obtained concurrently. Taken together, these sources of data provide good predictive power and add precision to the equation for teachers who need help identifying students most at risk as defined by end-of-year cut scores set by the school.
For participants of our study, the highest percentage of identified truly at-risk and not at-risk students occurred when the group achievement criterion was set at the lowest performing 25% for reading, but for math, the highest percentage of correctly identified students was obtained by setting the criterion at the bottom 15% on the group test. The fact that cutoffs differ as a function of academic area (reading vs. math) may be related to the general finding that reading CBMs predict end-of-year performance better than do math CBMs (Christ et al., 2008) or the specific characteristics of the particular measures and/or sample. Of course, particulars may differ as a function of idiosyncratic differences that exist from one system to another, such as the specific type of CBM employed, geography, socioeconomic status, gender, ethnicity, race, and so on. For example, system personnel may find that the most efficient criterion “cut score” for determining at-risk status is different from the values reported in this study (25% for reading and 15% for math).
Finally, and related to the points above regarding the predictive value of Teacher Rankings, cautious interpretation is necessary. In general, the literature is supportive of using teacher information, including rankings and ratings, to help make screening and placement decisions and to predict end-of-year high-stakes performance. This research adds to that literature. However, the accuracy of prediction varies across end-of-year performance levels and apparently teachers are not equally capable of providing accurate prediction. Consequently, system administrators who use rankings should provide orientation to all teachers as to the goals and procedures associated with any ranking process to help ensure fidelity.
Limitations and Future Research
A limitation of this study is the lack of fidelity information on scoring and administration of the CBMs and Teacher Rankings. Though teachers underwent supervised training detailing proper administration of the MIR: R and MIR: M and they used scripted instructions, observation of each administration was not a possibility. However, during training, their respective schools’ literacy coaches used fidelity checklists; fidelity during training was observed to be 100% by the authors of the MIR instruments. During implementation, literacy coaches were tasked with observing teachers and retraining any teachers whose fidelity compliance fell below 80%. In addition, ethnic and regional diversity within this sample is limited, and some specific demographic data were unavailable for participants, though the systemwide K-12 demography for all third-grade classrooms in the district allows a reasonably accurate characterization. For example, individual students were not categorized as being in general classrooms (vs. special education), though special education students (except those with profound disabilities) were included in the sample.
In addition, there are a number of strategies available for obtaining Teacher Rankings. We provided teachers with a class roster and asked them to rank their students’ needs based upon their achievement in the areas of reading and math. Other options for operationalizing rankings might include modifying the instructions to teachers, obtaining rankings from multiple teachers, or to focus on strengths rather than needs. In addition, teachers could be asked to rank students not just in more global reading and math performance, but also in more specific areas of performance (e.g., word calling, comprehension, or math calculation vs. math reasoning). In the future, researchers may consider alternative (and more targeted) methods of gathering information regarding Teacher Rankings. In particular, a more focused strategy may help teachers rank students within homogeneous classrooms (i.e., classrooms containing students who exhibit very little difference in academic performance). Of course, these recommendations cannot eliminate limitations associated with evaluating the accuracy/utility of rankings obtained across multiple classrooms when classroom-level academic skills vary significantly. In the future, researchers may address this by modifying the methodology (e.g., blocking classes by performance level than comparing ranking accuracy to some predetermined standard).
Furthermore, researchers may be motivated to compare the predictive relations found in this study with those between Teacher Rankings and other and more commonly used CBMs or progress monitoring tools, such as the DIBELS (Good & Kaminski, 2002), AIMSweb© (www.aimsweb.com), and STAR Reading (Renaissance Learning, 2015b). This research could be extended by investigating relations between teacher characteristics (e.g., scores on teacher evaluation systems, teacher attitudes toward and/or knowledge of assessment, ratings of teachers based on student value-added data) and their accuracy in rating student performance.
Finally, generalizations are limited to similar size classes and schools/systems with similar characteristics (e.g., demographics, geography). Participants in this study were from a predominantly White, rural school system in the south.
Significance of the Study and Summary
These findings contribute to the literature in that results confirm the predictive power of reading and math group-administered CBMs when end-of-year group achievement test performance is the criterion. Results provide evidence in support of using a simple, quick strategy (Teacher Rankings) to increase accuracy in predicting at-risk status in reading and math. Within the context of this study Teacher Rankings predicted end-of-year scores better than CBMs, and rankings can be easily incorporated into an RTI screening process. Of note, the combined predictive power of CBMs and rankings provided significantly more power than either alone. Perhaps most importantly, school personnel in any school system can adopt the methodology applied in this study to determine the extent to which CBMs and/or rankings predict high-stakes tests within the context of their system, and ultimately make decisions about which high-stakes cut scores are most appropriate for their unique contexts. However, prediction does vary as a function of the end-of-year performance levels and quite likely as a function of the predictive capacity of teachers. Finally, school personnel should keep in mind that analyses employed in this study are sensitive to sample size and results should be interpreted with caution.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
