Abstract
We used meta-analysis to examine the criterion validity of four scoring procedures used in curriculum-based measurement of written language. A total of 22 articles representing 21 studies (N = 21) met the inclusion criteria. Results indicated that two scoring procedures, correct word sequences and correct minus incorrect sequences, have acceptable criterion validity with commercially developed and state- or locally developed criterion assessments. Results indicated trends for scoring procedures at each grade level. Implications for researchers and practitioners are discussed.
Since the passage of the Education for All Handicapped Children Act (EAHCA) in 1975, Individual Education Plans (IEPs) and annual goals for students with disabilities have been a hallmark of special education. Schools are required to report progress toward these goals at least annually. The 2004 reauthorization of EAHCA, renamed the Individuals With Disabilities Education Improvement Act, required that a plan for monitoring progress toward annual goals be included in a student’s IEP and that reports of progress be given to parents at least as often as peers not receiving special education services receive reports of their academic progress (i.e., report cards; Individuals With Disabilities Education Act, 2004).
Educators monitor many aspects of students’ educational activities. They may be concerned about progress toward academic outcomes or social-behavioral goals (Archer & Hughes, 2010; Epstein, Atkins, Cullinan, Kutash, & Weaver, 2008). Within the academic domain, some teachers employ curriculum-based assessments (e.g., Venn, 2013), but curriculum-based measurement is probably the most widely employed method for monitoring progress and making educational decisions (Lembke, Hampton, & Hendricker, 2013). Curriculum-based measures have been employed to assess diverse aspects of reading, arithmetic, spelling, and writing (Hosp, Hosp, & Howell, 2012).
As with any measurement system, curriculum-based measures should meet standards for technical adequacy. They should be reliable and valid. In this study, we examine aspects of the technical adequacy of a progress monitoring measures for written language. After a brief description of the various measures used to monitor students’ writing performance, we discuss the research about psychometric qualities of measures of writing.
Written Language
Assessing written language is complicated by the multi-faceted nature of the skill. Although there is some disagreement as to the exact components in writing, there is general consensus about five basic elements: handwriting, spelling, mechanics, usage, and ideation (Taylor, 2006). Often experts emphasize some of these elements of writing more than other elements. For example, Graham, Harris, and Hebert (2011) argued that presentation effects (e.g., spelling, handwriting, and grammar) should not overly factor into judgments of writing for students with disabilities. Espin (2014), however, argued that writing instruction and assessment should focus on effectively communicating a message to readers, which includes skills such as grammar, spelling, and punctuation.
Cutler and Graham (2008) conducted a national survey of primary grade teachers and found that approximately 70% of teachers reported monitoring student writing at least weekly. Although it is encouraging that a large majority of teachers self-reported assessing student writing weekly, it is concerning that almost a third of teachers are not assessing student writing on a weekly basis.
The lack of frequency in assessing writing is understandable given the time required for assessing writing. Teachers can spend a significant amount of time assessing writing due to the fact that for many components of writing, there is not a “right answer.” Ideation, for example, is largely subjective and requires the teacher to assess the ideas developed by the student in relation to the purpose for writing. These judgments take significant time for the teacher to make for one student, and the time demands are compounded when monitoring an entire class. These time demands increase as the students age and their writing becomes more complex.
Curriculum-Based Measurement
Curriculum-based measurement (CBM) can decrease the time necessary to monitor writing progress. Deno (1985; Deno, Mirkin, & Marston, 1980) led the initial development of CBM with the goal of establishing a set of measures that were technically adequate (e.g., reliable and valid), efficient to administer and score, inexpensive, sensitive to growth, and easily understood. His efforts established procedures for administering and scoring CBMs in reading, mathematics, written language, and spelling (Deno, 1985).
CBM for written language (CBM
Words Written (WW)
WW was one of the CBM-W scoring indices developed by Deno et al. (1980). In WW, the scorer simply counts the number of words in the student’s writing sample. Words do not have to be spelled correctly or used in correct grammatical function. Any two conjoined letters are counted as a word. Single letter words (e.g., “I” and “a”) are also counted.
Words Spelled Correctly (WSC)
WSC was also originally developed by Deno et al. (1980). This scoring procedure involves counting WSC in the writing sample regardless of the appropriateness of usage (much like a computer spell-checker would assess correctly spelled words).
Correct Word Sequences (CWSs)
CWS was developed by Videen, Deno, and Marston (1982). This scoring procedure involves counting sequences of correct words. A CWS is defined as two adjacent words that are both spelled correctly and used appropriately in a sentence. The first scored sequence in a sample is from “blank-to-first-word.” The last scored sequence is from “last-word-to-blank.” In this scoring procedure, there will always be one more possible sequence than words in the writing sample.
Correct Minus Incorrect Word Sequences (CIWSs)
CIWS was developed by Espin et al. (2000). The rules for this scoring procedure follow the rules for CWS. In addition, the number of incorrect word sequences are counted and subtracted from the total CWS. The difference between CWS and incorrect word sequences is the student’s score for the writing sample.
Technical Adequacy
As with any worthwhile measure, if CBM-W is to be used as a progress monitoring tool, it must meet certain technical requirements of reliability and validity. Previous research in CBM-W has focused specifically on criterion validity and alternate form reliability. Criterion validity assesses the degree to which a measure correlates with another measure (Taylor, 2006). Although reliability of a test is certainly important and necessary, a measure that produces accurate but invalid scores is meaningless.
Establishing criterion validity of a measure in writing is difficult due to disagreement in the field about how much weight should be placed on various elements of writing. Researchers have typically reported lower criterion validity for writing measures than reading measures. For example, researchers have reported criterion validity coefficients of .85 or higher for the Woodcock Reading Mastery Test–Revised with the Woodcock–Johnson Psychoeducational Battery as a criterion measure (Taylor, 2006). In contrast, researchers have reported criterion validity coefficients of .30 to .50 for the Test of Written Language–3 with the Comprehensive Test of Nonverbal Intelligence as a criterion measure (Taylor, 2006). For CBM measures in reading, researchers have reported criterion validity coefficients that reach .70 to .80 and even approach .90 at times (Wayman, Wallace, Wiley, Tichá, & Espin, 2007). Given the criterion validity of standardized writing measures in comparison with standardized reading measures and the difficulty defining good writing, CBM measures in written language would be expected to have lower criterion validity than CBM measures in reading.
Another way of determining the validity of a measure is to consider the consequences of decision making with and without the measure (Messick, 1989). Researchers and practitioners do not have many options for progress monitoring tools in writing that are sensitive to growth, technically adequate, and efficient to administer and score. A lack of measures does not justify using measures with a dearth of evidence supporting their technical properties. However, researchers and practitioners might accept measures with lower criterion validity in writing than they would in reading due to the numbers of available measures in each of these areas.
Fuchs (2004) described three stages of research necessary for determining the usefulness of CBM. Stage 1 requires investigating the technical features of a static score. Stage 2 requires examination of technical features of CBM slopes (i.e., rates of growth over time). Stage 3 requires examination of CBM’s usefulness in instruction. Although Stages 2 and 3 are important considerations, Stage 1 is foundational for CBM-W to be used as a progress monitoring tool.
McMaster and Espin (2007) conducted a comprehensive literature review of CBM-W studies investigating the reliability and validity of these measures (i.e., Stage 1). Their review included studies across all K–12 grade levels. Their review was helpful for beginning to answer questions related to Stage 1; however, the methods of their review led to highly variable results for validity and reliability. For example, criterion validity was reported as low as −.24 and as high as .99 across the studies. As a literature review, they did not calculate aggregated statistics across studies. Instead, they presented and summarized the findings of each study. The most common scoring procedures across studies were WW, WSC, CWS, and CIWS. They found criterion validity differed for varying grade levels. Furthermore, they found that studies conducted after the initial Institute for Research on Learning Disabilities (IRLD) studies, which developed CBM, had much lower criterion validity than the IRLD studies did and that studies that reported results across grades had higher criterion validity than studies that reported results within grades.
McMaster, Ritchey, and Lembke (2011) summarized the existing CBM research for early writers in each of Fuchs’s (2004) three stages. Of the 11 studies they described, all had answered Stage 1 questions, five answered Stage 2 questions, and none answered Stage 3 questions. Similar to McMaster and Espin (2007), McMaster, Ritchey, and Lembke (2011) provided a narrative description of studies and identified strengths and weaknesses without calculating statistics across studies. The most common scoring procedure in these studies was WW, WSC, and CWS.
Rationale and Research Questions
The purpose of this meta-analysis was to extend the work of McMaster and Espin (2007) and McMaster, Ritchey, and Lembke (2011). Specifically, we wanted to increase the specificity of criterion validity determinations. Calculating a criterion coefficient in a meta-analytic way has the advantage of providing a more precise result by giving a criterion validity coefficient with a confidence interval, rather than a range of the minimum and maximum coefficients found in all studies. Mean coefficients across several studies are a more stable indicator of criterion validity than results from individual studies. Therefore, this review sought to answer four basic questions regarding the use of the four most common CBM-W scoring procedures found in the two previous reviews.
Method
Literature Review
We adopted search procedures from McMaster and Espin (2007) as a model and searched the ERIC, PsycINFO, and Science Citation Index Expanded databases using the search terms curriculum based measurement, curriculum based measure, general outcome measure, and progress monitoring. In addition, we searched the ERIC, Academic Search Complete, and PsycINFO databases using the search terms written language and curriculum based measurement. In total, our searches returned 11,802 results. Using RefWorks, we removed exact duplicates and had 9,771 results remaining. Before searching the results, two authors reviewed 200 titles and abstracts to ensure agreement with the inclusion criteria. We had 100% agreement after this initial review of results. Next, we divided the search results between two authors and used 2,195 results for intercoder agreement. We calculated kappa at .67 (with three disagreements) based on the review of these 2,195 search results. After reviewing titles and abstracts, 22 met the inclusion criteria for our study.
Inclusion Criteria
Focus of study and design
First, to be included, studies had to report quantitative measurement of criterion validity for at least one of the four forms of CBM-W under investigation (WW, WSC, CWS, CIWS) in a peer-reviewed journal. We included only peer-reviewed journal articles due to concerns regarding the rigor of non-peer reviewed publications. We also reasoned that strong reports and dissertations would subsequently be published in peer-reviewed sources. However, including only peer-reviewed studies opens the possibility of publication bias affecting results. Therefore, we conducted a test for publication bias. Criterion validity could be determined by commercially developed achievement measures (e.g., the Test of Written Language–3, Woodcock–Johnson III) or a state- or district-developed achievement measure. We did not require normative data be available for all state- or locally developed assessments. Studies that used locally developed measures had to use a test or rubric developed by the district and not by individual teachers. We excluded studies that used only holistic ratings, teacher ratings, developmental sentence scoring, grade point average (GPA), or classroom grades. To be included, studies had to provide information necessary to calculate a mean correlation (correlation coefficient and sample size).
Participants in the study
We included only K–12 studies. We did not require student disability status to be reported. Most of the studies we found included students with and without disabilities and did not report results for these groups separately.
Final corpus
Campbell (2010) was excluded because it only included students learning English as a second language, and it examined the utility of passage copying tasks for students in ninth to 12th grade. Passage copying is not a typical writing task for the majority of high school writers. Because the data for D. C. Parker, McMaster, and Burns (2011) were drawn from McMaster, Du, et al. (2011), we considered these two articles as one study. After making decisions about all articles in our search records, we compared our results with McMaster and Espin’s (2007) to ensure we did not neglect any studies that met our criteria.
After excluding and combining studies, 21 studies remained to form the corpus of this analysis. Often, studies reported correlations for several grades individually. In addition, many studies reported correlations with several criterion measures for each grade level and each scoring procedure. In total, we examined 739 correlations for this analysis.
Coding
We assigned codes for scoring procedure, grade level, and criterion assessment type. We found 44 different types of CBM-W scoring procedures in these studies. By a wide margin, the three most common scoring procedures were WW, WSC, and CWS. Of the 22 articles, 95% (n = 21) reported criterion validity of WW, 73% (n = 16) reported criterion validity of WSC, 100% (n = 22) reported criterion validity of CWS, and 45% (n = 10) reported criterion validity of CIWS.
We initially coded grade level as it was reported in each study. Grade level ranged from K–12. However, there were too few studies at some grade levels to make analysis at the individual grade level sensible, so we grouped studies into four categories: K–2, third to fifth, sixth to eighth, and ninth to 12th. Two studies did not fit into these grade categories. Gansle, VanDerHeyden, Noell, Resetar, and Williams (2006) included students in second to fifth grades and did not report correlations for the individual grade levels; we coded this study as a third to fifth study. Cheng and Rose (2009) reported results for students in seventh to 12th grades collectively; we coded it as a ninth- to 12th-grade study.
We coded two types of assessments. We distinguished between commercially developed, norm-referenced assessments and state- or district-developed assessments. Studies could report total or subscale-subtest results for criterion measures. If we had questions about whether a commercially developed assessment was norm referenced, we searched the test manual or the Internet for normative information for the given assessment. We excluded the Test of Emerging Academic English used in Campbell, Espin, and McMaster (2013) because we were interested in criterion validity using criterion measures that assess a wide range of students, not English Language Learners only. Two raters reviewed all articles, and interrater reliability (agreements/total of agreements and disagreements) was 97.3%. We reconciled the disagreements as a group and used the reconciled codes in the subsequent analyses.
Selecting Correlations for Analysis
Due to the large number of correlations, we selected the highest reported correlation for each scoring procedure for each sample. Selecting the highest correlation for each sample gave each scoring measure the highest possible score to be included in our calculations. Some studies reported the criterion validity of CBM-W with a large number of tests and subtests, some of which had little to do with writing ability. For example, Espin, Scierka, Skare, and Halverson (1999) reported criterion validity of CBM-W with subtests of the California Achievement Test. The subtests included reading, math, language arts total, language arts expression, and language arts mechanics. We were only interested in CBM-W measures as a predictor of performance on criterion measures that assessed writing or language arts ability. By picking the highest correlations, we avoided diluting the results with irrelevant criterion measures, such as a math subtest. Selecting the highest correlation also gave the scoring procedures the best possible criterion validity under the best administration procedures (i.e., prompt type and duration) for each study. Many studies found that criterion validity increased as the duration of writing was extended. The correlation selection procedures avoided the shorter writing durations tempering the criterion validity of the scoring procedure for these studies. We treated grade levels that were reported individually as separate samples and selected the highest correlation for each scoring procedure in each grade level. For example, if a study reported second and third grade separately (e.g., Ritchey & Coker, 2013), we selected the highest correlation for each scoring procedure (in this case WW, WSC, and CWS) in each grade level. Ritchey and Coker used two different prompts (story starter and picture story), two different writing durations (3 and 5 min), and one criterion measure (Woodcock–Johnson III Writing Samples subtest). Ritchey and Coker reported a total of 18 correlations. However, by selecting the highest correlation for each scoring procedure at each grade level, we included six for our analysis: the highest correlation for WW, WSC, and CWS in second and third grade. In total, we selected 102 correlations across all studies for our analysis.
Calculating Mean Correlations
To determine whether there was enough variability in our data to permit an exploration of our research questions beyond reporting an overall mean weighted correlation value, we calculated Q and I2 values. Cochran’s Q assesses homogeneity of data points that were used to calculate a weighted mean (e.g., effect size or mean correlation; Cochran, 1954). A significant Q statistic indicates heterogeneity in the data used to calculate the weighted mean (Cochran, 1954). Higgins and Thompson (2002) recommended the I2 statistic, which measures the percentage of homogeneity across the data; they argue that Cochran’s Q is underpowered to detect heterogeneity when a meta-analysis includes a small number of studies. All mean weighted correlations and Q statistics were calculated using Comprehensive Meta-Analysis (Version 2) software.
Using all selected correlations, we calculated a Q statistic of 396.229 (p < .000, I2 = 76.024). The I2 value calculated here indicates that 76% of the variance is likely not due to chance or error. We conducted a second Q statistic calculation using the highest reported correlation for each study regardless of scoring procedure. This analysis produced a Q statistic of 89.297 (p < .000, I2 = 68.644). The I2 value indicates that approximately 64% of the variance is likely not due to chance or error. The Q statistics and I2 statistics for these overall correlations indicate a high level of heterogeneity within the data. These results indicated that there was enough heterogeneity within our data set to explore our research questions.
We checked for publication bias in our results by calculating Orwin’s failsafe N (Borenstein, 2005). When calculating Orwin’s failsafe N, we used the mean criterion validity coefficient found when combining all scoring procedures. The overall mean weighted correlation was .55. We assumed that the mean effect of all hidden studies was .00 and calculated the number of hidden studies necessary to lower the mean weighted correlation to .30. Orwin’s failsafe N indicated that 29 studies would have to be unpublished (or published via non-peer-reviewed sources) with a mean effect of .00 in the hidden studies.
Typically, the next step in meta-analytic research would be to conduct inferential statistics to explore sources of heterogeneity through our research questions. However, we did not calculate inferential statistics due to small sample sizes. Instead, we calculated weighted correlations and confidence intervals, similar to Slavin’s (1986) recommendations for a Best Evidence Synthesis.
Results
A total of 22 articles (21 studies due to combining D. C. Parker et al., 2011 and McMaster, Du, et al., 2011) were included in this meta-analysis. Studies were published between 1991 and 2015. The data in the studies represented the written language performance of 3,361 students. Fourteen studies were published after McMaster and Espin (2007). See Table 1 for a summary of each study.
Peer-Reviewed Studies Examining Criterion Validity of Curriculum-Based Measurement in Written Language.
Note. WW = words written; WSC = words spelled correctly; CWS = correct word sequence; TOWL = Test of Written Language; CAT = California Achievement Test; CIWS = correct minus incorrect word sequence; ITBS = Iowa Test of Basic Skills; LEAP = Louisiana Educational Assessment Program; WJ-R = Woodcock–Johnson–Revised; SAT-9= Stanford Achievement Test–9th Edition; WKCE = Wisconsin Knowledge and Concepts Exam; MBST/MCA = Minnesota Basic Standards Test/Minnesota Comprehensive Assessment; MCA = Minnesota Comprehensive Assessment; TOWL-3 = Test of Written Language–3; TEWL-2 = Test of Early Written Language–2; WJ-III = Woodcock–Johnson III; AIMS = Arizona Instrument to Measure Standards writing assessment; ISTEP Plus EOC = Indiana Statewide Testing for Educational Progress Plus End of Course Assessment; IEP = Individual Education Plan.
Only scoring procedures related to this study are reported here (WW, WSC, CWS, CIWS). Some of these studies included other scoring procedures not reported here. bThese studies used specific samples. Parker, Tindal, and Hasbrouck (1991) used a sample of students receiving IEP services. Cheng and Rose (2009) used a sample of students receiving services for hearing impairments. Campbell et al. (2013) used a sample of students receiving services as English Language Learners. cCBM-W probes were administered in 10th grade and criterion measures were administered in 11th grade.
To calculate the overall mean criterion validity of all CBM-W scoring procedures when combined, we selected only the single highest criterion validity coefficient from each sample. When selecting only the highest reported correlation for each study, we calculated an overall correlation of .55 (CI = [.51, .60]). This correlation represents the highest possible mean weighted correlation for all four CBM-W scoring measures combined.
The overall correlations for all four scoring measures are presented in Table 2. WW had an overall correlation of .37. WSC had an overall correlation of .44. CWSs had an overall correlation of .51. CIWSs had an overall correlation of .60.
Mean Correlations for Scoring Procedures.
Note. WW = words written; WSC = words spelled correctly; CWS = correct word sequence; CIWS = correct minus incorrect word sequence.
Mean correlations were calculated for each scoring procedure in four different age groups. Results are presented in Table 3. In each grade category, CIWS had the highest criterion validity.
Mean Correlations for CBM Scoring Procedures Across Four Grade Categories.
Note. WW = words written; WSC = words spelled correctly; CWS = correct word sequence; CIWS = correct minus incorrect word sequence.
Mean correlations were calculated for each type of criterion assessment by combining all scoring procedures. Results are presented in Table 4. The overall correlation for state-developed assessments was .61. The overall correlation for commercially developed assessments was .54.
Correlations of Combined Scoring Procedures for State- and Commercially Developed Tests.
Discussion
Our meta-analysis addressed questions about what Fuchs (2004) termed Stage 1 research in CBM. Overall, we found a mean correlation of r = .55 when combining all CBM-W scoring indices. This correlation is the highest possible criterion validity coefficient when aggregating the highest correlation coefficients from each study. This correlation includes all types of scoring procedures for all grade levels. Therefore, it is difficult to make conclusive statements about CBM-W based on this one statistic. In general, these findings fall in the middle of the ranges reported by McMaster and Espin (2007); however, this result allows researchers to provide a more specific and stable answer to the question of how valid CBM-W scores are. Based on the study samples included in this meta-analysis, CBM scoring procedures appear to have moderate criterion validity in relation to commercially and state- or locally developed criterion measures.
When examining each scoring procedure individually, CIWS (r = .60) had the highest criterion validity followed by CWS (r = .51), WSC (r = .44), and WW (r = .37). Although we did not conduct inferential statistics due to the small sample size, there was overlap in the confidence intervals of CIWS and CWS, CWS and WSC, and WSC and WW. CWS and CIWS appear to substantially outperform WW or WSC.
Some may find these results surprising. WW had high criterion validity at its initial development (Deno et al., 1980), but our results are consistent with the trends McMaster and Espin (2007) described: After initial IRLD studies, CBM-W measures tended to have lower criterion validity. CIWS was developed much later than the other measures and was primarily developed for use with secondary students (Espin et al., 2000). Although it has few studies examining criterion validity with younger students, it does appear to be a promising measure in all grade levels.
When examining CBM-W measures at each grade level, we identified trends for each scoring procedure. Specifically, WW had the lowest and CIWS had the highest criterion validity at each grade level. The confidence interval for WW had little overlap with the confidence interval for CIWS indicating that CIWS likely has stronger criterion validity than WW at every grade level. In most grade levels (3–5 being the only exception), criterion validity increased as the complexity of the scoring procedure increase. For all other grade levels, CIWS was the highest followed by CWS, WSC, and WW.
Finally, no major differences were detected between using commercially developed (r = .54) and state-developed assessments (r = .61) as the criterion. CBM-W appears to predict performance similarly on both types of criterion measures included in this analysis.
Based on the results of Orwin’s failsafe N, we believe these results are robust to publication bias. The failsafe N test indicated that 29 studies with a mean effect of .00 would have to be hidden from this search to lower the overall mean correlation from .55 to .30. We believe it is unlikely that 29 non-peer-reviewed manuscripts exist that meet the rest of our criteria, and if 29 studies do exist, we do not believe the overall mean effect would be .00.
When answering questions about Phase 1 of CBM research, considering context of writing assessments is helpful. Although lower than what has been reported for CBM in reading (Wayman et al., 2007), these mean correlations are similar to correlations of other standardized writing measures. In an area such as written language where few progress monitoring measures exist, these correlations could be considered strong enough to influence instructional decision making.
Future Research
The results from this study have at least four implications for future research. First, meta-analytic research should continue answering questions about Stage 1 research in CBM. One possibility is to examine the criterion validity of specific administration procedures. For example, McMaster and Espin (2007) found that longer durations of writing typically lead to stronger criterion validity. Also, Espin et al. (2000) hypothesized that expository prompts may have higher criterion validity with secondary students because much of the academic writing they do in school is expository. Second, researchers should continue examining the criterion validity of CIWS. Our extensive search revealed relatively few studies examining the criterion validity of this scoring procedure. Because it had the highest criterion validity in all grade levels, it merits additional study.
Third, researchers should continue investigating the technical aspects of static scores as criterion measures change. The Test of Written Language–3 was commonly used as a criterion measure in these studies. However, the Test of Written Language–4 is now available. In addition, many states are changing state assessments to align with the Common Core State Standards. As new writing assessments are developed commercially and at the state level, researchers should continue to investigate the criterion validity of CBM-W measures with these new measures.
Finally, researchers should consider entering Phase 2 of CBM-W research. Some have begun this work already. For example, McMaster, Du, et al. (2011) examined the slopes of first-grade CBM data and determined that eight or nine data points are needed for reliable and stable slopes. Similar work needs to be done across grade levels and demographic groups.
Implications
Based on the results of this analysis, it appears that practitioners should discontinue the use of WW as a measure of overall writing progress. Of all the CBM-W scoring measures, it provided the weakest criterion validity. Furthermore, practitioners should feel comfortable with the CWS and CIWS scoring procedures as a moderate predictor of overall writing ability.
Limitations
In addition to the shortage of studies in certain grade levels, results of this analysis should be viewed in light of at least three other limitations. First, this review examined criterion validity of CBM-W measures only when using commercially developed or state-developed writing assessments. Using holistic ratings, teacher ratings, and GPA may produce different validity coefficients. However, based on findings from McMaster and Espin (2007), any differences are likely to be minimal.
Second, this analysis only included the highest coefficient for each scoring procedure in each sample. This selection procedure means that these results are the highest possible mean weighted criterion validity coefficients that can be calculated from these studies. Including more coefficients from each study could lower the criterion validity results reported here.
Finally, this analysis included only studies from peer-reviewed sources. Orwin’s failsafe N bolsters confidence in the results, but it cannot eliminate the possibility of publication bias.
Conclusion
The present meta-analysis was the first known study which calculated the mean weighted criterion validity of CBM-W measures. The findings suggest that practitioners should consider incorporating CBM-W into IEP development and progress monitoring. Researchers should consider analyzing technical characteristics of CBM-W slope data.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
