Teacher Trainees’ Administration and Scoring Errors on the Kaufman Test of Educational Achievement

Abstract

Achievement tests are used to make high-stakes (e.g., special education placement) decisions, and previous research on norm-referenced assessment suggests that errors are ubiquitous. In our study of 42 teacher trainees, utilizing five of the six core subtests of the Kaufman Test of Educational Achievement, Third Edition (KTEA-3), we found that while most trainees make errors, they do not make a large number per person with the exception of a few error-prone trainees. In addition, Wilcoxon signed-rank tests indicated that reading comprehension was the most prone subtest to administration (T = 120; p < .001) and clerical (T = 45; p < .01) errors. However, subtests pairwise comparisons indicated nonsignificant differences between error rates across subtests. Based on these findings, we recommend that training programs focus extra attention on reading comprehension and remediating students who make a disproportionate number of errors. Implications for future research are also noted.

Keywords

achievement tests KTEA errors norm-referenced assessment training

Norm-referenced tests of academic achievement, such as the Kaufman Test of Educational Achievement, Third Edition (KTEA-3; Kaufman & Kaufman, 2014), Wechsler Individual Achievement Test, Third Edition (WIAT-III; Wechsler, 2009) and the Woodcock–Johnson IV Tests of Achievement (WJ-IV ACH; Schrank, Mather, McGrew, 2014), are important educational tools. These measures allow educators to evaluate academic achievement and areas of strengths and needed remediation in mathematics, reading, writing, and oral language (Breaux & Lichtenberger, 2016). Norm-referenced achievement measures are used in Individualized Education Program (IEP) creation, to help evaluate a program’s effectiveness (e.g., examine a new reading program at a school), and as pre–post measures of intervention effectiveness (Breaux & Lichtenberger, 2016). In addition, norm-referenced measures are useful for creating interventions to remediate academic difficulties (Breaux & Lichtenberger, 2016; Waugh & Gronlund, 2013), which is critical as the primary goal of assessment should be to improve student learning (National Association of School Psychologists [NASP], 2009; Waugh & Gronlund, 2013).

Perhaps equally important to multidisciplinary evaluation team members, norm-referenced measures of academic achievement allow educators to differentiate students from their peers, make classification decisions (Waugh & Gronlund, 2013), and are widely used for identifying if a child has a specific learning disability (SLD; Breaux & Lichtenberger, 2016). SLD represents the largest educational disability category as approximately 47% of students served by special education are identified with an SLD (National Center for Education Statistics, 2018). Norm-referenced measures of academic achievement are frequently relied on by educational teams as a primary source to make eligibility determinations.

Because norm-referenced measures of academic achievement are critical for making high-stakes decisions (e.g., special education eligibility; NASP, 2009), examiners’ fidelity for implementing these measures should be understood. However, a recent search using Google Scholar revealed only one article (i.e., Harrison, Goegan, & Macoun, 2018) that examined administrators’ performance on norm-referenced measures of academic achievement. This is surprising as there have been numerous studies (e.g., Alfonso, Johnson, Patinella, & Rader, 1998; Lobello, & Holley, 1999; Loe, Kadlubek, & Marks, 2007; Mrazik, Janzen, Dombrowski, Barford, & Krawchuk, 2012; Oak, Viezel, Dumont, & Willis, 2019; Sherrets, Gard, & Langner, 1979; Slate, Jones, & Murray, 1991) that examine school psychologists’ proficiency with measures of cognitive ability.

The consensus from these studies indicates that significant administration and scoring errors are typical on measures of cognitive ability. Indeed, studies have found that 50% to 80% of graduate students’ Wechsler IQ protocols contain inaccurate overall cognitive scores, and that these errors generally range from two to four points (Loe et al., 2007). Furthermore, Slate et al. (1991) found that roughly one quarter of graduate student protocols contained scoring errors that were greater than 4 points; thus, these errors fall outside of the standard error of measurement. Common errors include incorrect age calculation, incorrect start point, incorrect use of reverse and discontinuation rules, addition errors, and assigning points incorrectly (Kaufman, Raiford, & Coalson, 2015; Ramos, Alfonso, & Schermerhorn, 2009).

A recent meta-analysis by Styck and Walsh (2016) suggests that 41.2% of cognitive assessment protocols contained at least one error when failure to record responses was not coded as an error. This study also suggested that a graduate student is more likely to commit one or more errors than a practitioner. Specifically, 70.3% of graduate students and 33.6% of practitioners had one or more errors when failure to record responses was not coded (Styck and Walsh, 2016). In addition, Styck and Walsh found an average of 5.5 errors when recording omissions were not considered errors. Of note, the results suggested that graduate students were more likely to commit at least one error, but the practitioners who made errors averaged more than twice as many errors as the graduate students per protocol (Styck and Walsh, 2016).

As administration errors are frequently found on cognitive measures, it follows that the same could be true of norm-referenced assessments of academic achievement. This is due to the similarities between test administration and scoring procedures. Both cognitive and norm-referenced academic achievement measures require the administrator to provide a distraction-free environment, be skilled in establishing proper rapport, be able to detect signs of disinterest and fatigue, and frequently switch between tasks. Importantly, both types of assessment require strict adherence to standardized procedures that include correctly following basal and ceiling rules that might invalidate test scores when violated (Breaux & Lichtenberger, 2016; Flanagan & Alfonso, 2017).

It would seem pertinent to examine the administration and scoring errors of special education teachers instead of school psychologists for multiple reasons. First, assessment is often a key role of special educators (Council for Exceptional Children, 2012) as a recent survey indicated that approximately 30% of the tests of academic achievement used to make special education eligibility determinations are conducted by teachers (Lockwood et al., 2019). Second, special education teachers frequently administer these measures, but receive less education, training, and practice in norm-reference assessment than school psychologists receive. School psychologists have multiple opportunities to administer standardized norm-referenced tests when learning to use these measures with the assumption that practice will improve their assessment fidelity (Loe et al., 2007). For instance, school psychology graduate students conduct seven or more administrations during their first assessment course on average (Alfonso, LaRocca, Oakland, & Spanakos, 2000; Lockwood & Farmer (in press); Oakland & Zimmerman, 1986). School psychology practitioners also average 66.6 psychoeducational evaluations per year (Castillo, Curtis, & Gelley, 2012) and spend roughly 25% of their time administering tests for SLD eligibility determination (Benson et al., 2019). In addition, the majority of school psychology graduate students have taken a course on psychometrics prior to taking their assessment course (Alfonso et al., 2000). We were unable to locate any published materials on the typical education and training of special educators in norm-referenced assessment, and we are unaware of any training programs that require special education teachers to take a course dedicated to psychometrics. Furthermore, we were unable to find information on the average yearly number of norm-referenced assessments administrations by special educators. Nonetheless, we can estimate the frequency using existing data with some assumptions; as special education teachers in the United States have an average caseload of 17.3 students (Levenson, 2012), and if evaluations generally occur on a triennial basis, it is likely that the average special educator would conduct just under six evaluations per year (i.e., 17.3/3 = 5.77).

Harrison and colleagues (2018) conducted the only study examining special education teachers’ norm-reference, standardized academic achievement test error rates. In their examination of 31 KTEA-II, 29 WIAT-III, and 54 WJ-III protocols completed by special education students enrolled in a Canadian assessment course, they found that 98% contained at least one error. These were defined as starting point errors, basal rule errors, discontinuation rule errors, and marking errors (e.g., computation errors or marking a correct item as incorrect; Harrison et al., 2018). These investigators found, on average, 1.39 starting point errors, 1.29 basal rule errors, 1.97 discontinuation rule errors, and 3.39 marking errors on the KTEA-II. However, this study utilized a checklist intervention that could have decreased the number of starting and discontinuation errors (Harrison et al., 2018). In addition, these authors utilized an outdated version of the KTEA (i.e., KTEA-II) and did not provide descriptive statistics for each subtest (e.g., number of starting errors on Math Concepts and Applications; Harrison et al., 2018).

KTEA

The KTEA-3 (Kaufman & Kaufman, 2014) is a more recently published and common measure of academic achievement (Breaux & Lichtenberger, 2016). There appears to be a scarcity of research regarding the frequency with which teachers utilized the KTEA-3; however, the KTEA-3 is indicated as the most commonly used norm-referenced measure of academic achievement by school psychologist (Benson et al., 2019). The KTEA-3 can be administered to individuals from 4 to 25 years of age. It measures reading comprehension, reading fluency, basic reading, math problem solving, math calculation, written expression, listening comprehension, and oral expression. These are areas of SLD classification as outlined by the Individuals With Disabilities Education Act (IDEA, 2004). The KTEA-3 provides age (n = 2,050) and grade norms (n = 2,600) from a sample that was stratified to match the U.S. population based on the 2012 Census Bureau’s Survey (Breaux & Lichtenberger, 2016).

The KTEA-3 was found to have mean grade-based composite reliabilities that ranged from .80 to .90. The mean grade-based subtest reliability coefficients also fell in this range, except for the oral fluency composite (.72), writing fluency (.76), associational fluency (.62), object naming facility (.69), and letter naming facility (.67; Breaux & Lichtenberger, 2016). Lichtenberger Validity evidence was also provided using confirmatory factor analysis, which indicated an overall good model fit and high factor loadings for subtests. In addition, the reading, math, and written language composites of the KTEA-3 were correlated between .80 and .91 with corresponding composites on the WIAT-III and WJ-III ACH, which suggested concurrent validity evidence (Breaux & Lichtenberger, 2016). In sum, the KTEA-3 is ubiquitously used for special education assessment and has adequate psychometric properties.

Current Study

The purpose of this study was to examine teacher trainee fidelity when administering the KTEA-3. We examined administration and clerical errors made by teacher trainees to mirror the research on school psychology trainees’ fidelity in the administration of cognitive measures. Furthermore, we wanted to explore if teacher trainees are adequately prepared to use these assessments because of their potentially limited training. We selected the KTEA-3 because it is a recently developed and widely used measure with adequate psychometrics.

This study expanded on Harrison et al.’s (2018) study by (a) providing descriptive statistics by error type for each subtest and providing information about where the specific errors occur (e.g., start vs. stop point); (b) examining administration data without the confound of the checklist, which is not provided by the publisher and, therefore, unlikely to be used by most administrators; (c) utilizing the current edition of the KTEA-3, as this measure has been available for approximately 5 years and has experienced updates since the second edition (Kaufman & Kaufman, 2014). Specifically, this study was conducted to answer the following questions:

Do teacher trainees make errors when administrating and scoring the KTEA-3?

If trainees do err, what are the types and frequencies of these errors?

Are particular subtests of the KTEA-3 more prone to administration and/or raw score errors?

Method

Participants

Thirty-five undergraduates and seven graduate students (n = 42) participated in the study. The students were enrolled in a required three credit hour course titled Assessment of Exceptional Learners in the 2018 spring and summer semesters at a state university in the Western U.S. The undergraduate students were enrolled in the spring semester session. They were at the end of their junior or senior year in a dual certification, institutionally approved special and elementary education program. The graduate students were enrolled in an institutionally approved program of study leading to certification in mild/moderate disabilities education. They took the class in the summer session. For all of the students, this was their only special education assessment class. All of the students reported no previous training in administering individual norm-referenced tests of achievement.

Setting

The data collected were obtained as part of the required assessment class for all students seeking state certification in the mild/moderate disabilities area. The undergraduate course took place over the 16-week spring semester, meeting for 2.5 hr/week. The undergraduate course was offered in face-to-face format only. The graduate course took place over a 6-week summer session. It was comprised of web-based instruction and activities, as well as three mandatory 6-hr face-to-face meetings.

In the undergraduate class, the KTEA-3 was taught over a 4-week period. The course required students to read the KTEA-3 manual (Kaufman & Kaufman, 2014) and a text on assessment of exceptional learners (i.e., Overton, 2012). Students were tested on the topics of reliability, validity, test construction, administration, scoring, and interpretation. The students completed one guided practice scoring activity with data provided to them. Following the guided practice, the students independently completed one “mock” practice administration with a fellow classmate playing the role of a child examinee.

During the practice administration, students administered five of the six subtests that comprise the overall (i.e., Academic Skills Battery; Kaufman & Kaufman, 2014) composite, which were Letter and Word Recognition, Reading Comprehension, Math Concepts and Applications, Math Computation, and Spelling. Students were not instructed on administering the Written Expression subtest, which is also used to calculate the overall composite. Students computed the raw score and transformed these raw scores to standard scores, percentiles, stanines, and age and grade equivalents using hand-scoring procedures provided in the KTEA-3 scoring tables for each of the five administered subtests. In addition, students determined the confidence intervals for standard scores. Students were provided written feedback, and class time was allotted to correct any errors made following each practice activity. Corrections were resubmitted to assess for mastery in all components. Mastery was reached when the students had zero remaining errors on the mock administration.

Outcomes Measures

Two types of errors were coded from the KTEA-3 subtest and Composite Score Computation Form (SCSCF; Kaufman & Kaufman, 2014). These were administration and clerical errors.

Administration errors

Administration errors included incorrect start points, incorrect end points, and incorrect reverse rules (i.e., failing to give previous items when basal was not established).

Clerical errors

Clerical errors were divided into subtest raw score errors, front page errors, and use of the incorrect table. Raw score errors included incorrect addition of points earned. Front page errors consisted of the miscalculation of the chronological age, incorrectly transposing raw scores, incorrect conversion of subtest raw scores to standard scores using the publisher’s tables, incorrect calculation of confidence intervals, incorrect percentile rank, incorrect grade and/or age equivalent, incorrect stanine scores, incorrect summation of composite scores, and incorrect conversion of composite raw scores to standard scores using the publisher’s tables. Use of incorrect table occurred when a student failed to use the table that corresponded to the child’s chronological age. These data, with the exception of chronological age, and subtest and composite comparisons were recorded from the SCSCF. Front page errors were not coded as errors on their specific subtests at these errors were not conceptualized to be specific to characteristics of that subtest.

Procedures

Following the training, students were required to administer the Letter and Word Recognition, Reading Comprehension, Math Concepts and Applications, Math Computation, and Spelling subtests to general education students in third through eighth grades. The university students were required to complete all of the same tasks that they completed during the practice (“mock”) administration. Forty-two completed protocols from these general education student administrations were used in this study.

The protocols were evaluated for administration and clerical errors. Errors were coded separately by two raters. One rater served in the role of graduate assistant for the assessment courses for three semesters prior to the study and was a second-year school psychology graduate student with training and experience in administering the KTEA-3. The other rater was a doctoral-level trainer of school psychologists with 6 years of experience teaching norm-referenced assessment courses. Interrater agreement was 99% across all error coding. Discrepancies between raters were discussed among the researchers and consensus was reached regarding the correct coding.

Inferential Statistics Analysis Plan

All analyses were completed in SPSS v. 25. Nonparametric inferential statistics were used for the primary analyses. We made this decision because the distributions of data violated assumptions of the parametric statistics that could have been selected for the primary analyses. As the frequency distributions for administration and clerical errors were defined by significant positive skew (z > 3.29; Tabachnick & Fidell, 2013), the median is likely the most appropriate index of central tendency though means are also provided. Skewness was especially prevalent in clerical errors with four (9.5%) protocols accounting for 50.4% of observed errors. In addition, independent Mann–Whitney U tests that compared undergraduate and graduate students’ administration and raw score errors across all subtests indicated nonsignificant differences in error rates between groups (ps = .272-1.00; full results available upon request). Therefore, the undergraduate and graduate students’ data were pooled for the primary analyses.

First, we used one-sample Wilcoxon signed-rank tests to determine if the distribution of errors on each KTEA-3 subtest significantly deviated from a median of 0 errors. The significance level was adjusted to p < .005 (.05/10 comparisons) due to the multiple tests run.

Following these tests, two related-samples Friedman’s two-way analyses of variance (Friedman’s ANOVAs) were completed to examine if there were difference in errors across KTEA-3 subtests. The Friedman’s ANOVA provides an omnibus significance test statistic ( $χ_{F}^{2}$ ) and pairwise comparisons (T) to detect difference between conditions (i.e., subtests). The first Friedman’s ANOVA tested for differences in administration errors across subtests. The second Friedman’s ANOVA tested for differences in raw score errors across subtests. For both Friedman’s ANOVAs, the pairwise comparisons were evaluated for statistical significance using the Bonferroni adjusted p values for multiple comparisons.

For the Wilcoxon signed-rank test and Friedman’s ANOVA pairwise comparisons, the effect size (r) is computed by dividing the z score associated with the test statistic by the square root of the number of observations. For the Friedman’s ANOVA, the effect size is related to the pairwise comparisons, rather than the omnibus test.

Results

Descriptive Statistics

Descriptive statistics for total errors and administration and clerical errors are displayed in Table 1. A total of 129 errors were observed and errors on protocols ranged from 0 to 34 with a median of 2 (SD = 5.59). Furthermore, one or more errors were observed on 76.2% of protocols. Administration errors were noted on 47.6% of protocols, whereas clerical errors were observed on 64.3%.

Table 1.

Overall Errors.

	Protocols with errors		Errors per protocol
Error	n	%	Median	M	SD	Range
Total	32	76.20	2.00	3.07	5.59	0-34
Administration	20	47.61	0.00	0.79	1.07	0-4
Clerical	27	64.29	1.00	2.26	5.66	0-34

Specific error types are listed in Table 2. The most prevalent administration errors were end point errors that were noted on 38.1% of protocols followed by start point (14.3%) and reversal rule errors (9.5%). Clerical errors were made with some frequency; 40.5% of protocols contained a subtest raw score error on at least one subtest and one or more front page errors were observed on 31.0% of protocols. However, the median for all specific error types was zero. Table 3 lists the percentages of protocols with errors by subtest and type of errors. Reading comprehension was the subtest where participants made the most frequent errors with administration errors noted on 35.7%, and raw score errors observed on 21.4% of protocols. Table 4 shows the frequency of specific error types.

Table 2.

Specific Errors Types.

Errors	Protocols with errors		Errors per protocol
Errors	n	%	M	SD	Range
Administration
Start point	6	14.29	0.14	0.35	0-1
End point	16	38.09	0.55	0.89	0-4
Reverse rule	4	9.52	0.10	0.30	0-1
Clerical
Subtest saw score	17	40.48	0.45	0.59	0-1
Front page	13	30.95	1.64	5.54	0-32
Incorrect table	8	19.05	0.19	0.40	0-1

Note. The median for all rows was zero.

Table 3.

Subtests: Percentage of Administration Containing Administration Errors and Raw Score Errors.

Subtest	Administration errors		Raw score errors
Subtest	n	%	n	%
LWR	2	4.76	6	14.89
MCA	5	11.90	0	0
MC	2	4.76	1	2.38
RC	15	35.71	9	21.43
SP	4	9.52	3	7.14

Note. LWR = Letter and Word Recognition; MCA = Math Concepts and Applications; MC = Math Computation; RC = Reading Comprehension; SP = Spelling.

Table 4.

Administration Errors Versus Raw Score Errors.

Subtest	Administration			Raw score
Subtest	M	SD	Range	M	SD	Range
LWR	0.05	0.22	0-1	0.14	0.35	0-1
MCA	0.12	0.33	0-1	0.00	0.00	0
MC	0.05	0.22	0-1	0.02	0.15	0-1
RC	0.48	0.77	0-3	0.21	0.42	0-1
SP	0.10	0.30	0-1	0.07	0.26	0-1

Note. The median for all subtests was zero. LWR = Letter and Word Recognition; MCA = Math Concepts and Applications; MC = Math Computation; RC = Reading Comprehension; SP = Spelling.

Inferential Statistical Analyses

The results of the one-sample Wilcoxon signed-rank tests are presented in Table 5. After adjusting the significance level to p < .005, only the administration (r = .56; p < .001) and raw score (r = .46; p = .003) errors for Reading Comprehension were statistically significant. That is, the distribution of errors significantly drifted from a median of 0 for this subtest.

Table 5.

One-Sample Wilcoxon Signed-Rank Tests for Administration and Clerical Errors for Each Subtest.

Subtest	Administration errors		Clerical errors
Subtest	T	z	T	z
LWR	3	1.41	21*	2.45
RC	120***	3.62	45**	3.00
MCA	15*	2.24	0	–
MC	3	1.41	1	1.00
SP	10*	2.00	6	1.73

Note. LWR = Letter and Word Recognition; RC = Reading Comprehension; MCA = Math Concepts and Applications; MC = Math Computation; SP = Spelling.

p ≤ .05. **p ≤ .01. ***p ≤ .001.

The Friedman’s ANOVA for differences in administration errors across subtests was significant, $χ_{F}^{2}$ (4) = 27.53, p < .001. After adjusting for multiple comparisons, all subtest comparisons were nonsignificant. Similarly, the Friedman’s ANOVA for differences in raw score errors across subtests was significant, $χ_{F}^{2}$ (4) = 15.22, p = .004, but all subtest comparisons were nonsignificant. Table 6 contains the pairwise comparisons.

Table 6.

Friedman’s ANOVA Pairwise Comparisons for Administration and Raw Score Errors Across Subtests.

Comparison	Administration errors				Raw score errors
Comparison	T	z	p	Adjusted p	T	z	p	Adjusted p
LWR-RC	−0.79	−2.28	.023	.228	−0.18	−0.52	.605	1.00
LWR-MCA	−0.18	−0.52	.605	1.00	−0.36	−1.04	.301	1.00
LWR-MC	0.00	0.00	1.00	1.00	−0.30	−0.86	.388	1.00
LWR-SP	−0.11	−0.31	.756	1.00	−0.18	−0.52	.605	1.00
RC-MCA	0.61	1.76	.078	.785	0.54	1.55	.121	1.00
RC-MC	0.79	2.28	.023	.228	0.48	1.38	.168	1.00
RC-SP	−0.68	−1.97	.049	.492	−0.36	−1.04	.301	1.00
MCA-MC	−0.18	−0.52	.605	1.00	−0,60	−0.73	.863	1.00
MCA-SP	−0.07	−0.21	.836	1.00	−0.18	−0.52	.605	1.00
MC-SP	0.11	0.31	.756	1.00	−0.12	−0.35	.730	1.00

Note. LWR = Letter and Word Recognition; RC = Reading Comprehension; MCA = Math Concepts and Applications; MC = Math Computation; SP = Spelling; ANOVA = analysis of variance.

Discussion

The purpose of this study was to examine whether teacher trainees make errors when administrating the KTEA-3 and if errors are made, to determine the frequency and types of errors made. In addition, we hoped to determine whether specific subtests and/or procedures were more prone to errors. This is because there is limited research on teacher competency for the administration of norm-referenced assessments (Harrison et al., 2018), which is in contrast to the robust literature on psychologists’ competency with norm-referenced assessment (Styck & Walsh, 2016). The limited research is problematic due to the high-stakes use of norm-referenced academic achievement tests (NASP, 2009).

The majority of participants (76.2%) made at least one error on the five subtests we examined. The most commonly observed errors were subtest raw score, end point, and front-page errors. However, a median of two errors per protocol was found indicating that while errors do frequently occur, they are not excessive across these five subtests. Of note, the median errors for each subtest was zero, as was the median for all specific error types. The median number of clerical errors made was one, and four participants accounted for approximately half of all observed errors. This suggests that errors do frequently occur during administration; however, the amount of errors per trainee is fairly low with the exception of a few students. Although administration and raw score errors were mostly unobserved (median = 0), the Wilcoxon test indicated that the administration and scoring of the Reading Comprehension subtest led to a higher chance of errors than the median would suggest. This could be due to administration rules on Reading Comprehension, which are more complicated than the other core subtests. That is, they require using item set start points and decision stop points rather than general rules (e.g., discontinue after four consecutive scores of 0; Kaufman & Kaufman, 2014).

Comparison with Previous Research

We found fewer errors than previous studies found for measures of cognitive ability (i.e., Styck & Walsh, 2016). The reason for the lower frequency of errors is likely due to multiple factors. For instance, failure to query was a commonly made error in many studies of cognitive ability (e.g., LoBello & Holley, 1999; Loe et al., 2007; Mrazik et al., 2012; Slate et al., 1991; Slate et al., 1993); Multiple subtests of the WISC (i.e., Comprehension, Information, Similarities, and Vocabulary) require frequent querying based on examiner judgment (Flanagan & Alfonso, 2017); however, the KTEA-3 has limited queriable items. The queriable items are generally located on the Reading Comprehension and Listening Comprehension subtests (Breaux & Lichtenberger, 2016). As we did not examine Listening Comprehension, this left only four items with specific queries listed in the stimulus book (i.e., items 31, 67, 78, and 95 of Reading Comprehension; Kaufman & Kaufman, 2014). Due to the relatively low probability of these errors, we did not code them. In addition, awarding an incorrect amount of points is an error type that has been noted in many studies of cognitive ability (e.g., LoBello & Holley, 1999; Loe et al., 2007; Slate et al., 1991); subjectively scored subtests were not analyzed in this study. If such a core subtest (i.e., Written Expression) was examined in this study, we might have observed an increased number of errors due to the subjective scoring of this subtest.

Another reason we may have found fewer errors than previous studies was our decision not to include failure to record responses verbatim, which is the most commonly observed error type noted in many of the cognitive test studies (e.g., Alfonso et al., 1998; Loe et al., 2007). Coding omissions as errors may artificially increase the number of observed errors and make it appear that nearly 100% of protocols have errors (Styck & Walsh, 2016). Also, there is disagreement in the literature as to whether to consider failure to record as an error at all (Styck & Walsh, 2016). Although verbatim responses may provide important information they do not, by themselves, negatively affect the validity of test results. Due to this disagreement, and because we are interested in examining errors that will affect the validity of high-stakes decisions based on test scores, we chose not to count failure to record as an error. As previously noted, when failure to record errors is not coded, graduate students make errors on approximately 70.3% of protocols (Styck & Walsh, 2016), which is commensurate with our findings. In addition, graduate students make an average of 3.4 errors per protocol when failure to record errors is not coded (Styck & Walsh, 2016), which is commensurate with our findings of a mean of 3.1 and median of 2. Furthermore, our results were consistent with Mrazik et al. (2012) who found that the majority of clerical errors are made by a small proportion of students.

Our findings also indicated lower rates of errors than Harrison and colleagues’ (2018) study of norm-referenced academic achievement measures. Their study indicated errors on nearly all protocols. Also, they observed higher mean administration (range = 1.39-1.97) and clerical errors (M = 3.39). One possible reason for this difference is their inclusion of the written and the oral language composites, which accounted for 47% and 23% of errors, respectively. However, it is difficult to make direct comparisons with Harrison et al. as their manuscript did not provide average (mean or median) total errors, errors made per subtest, or indicate the percentage of KTEA-II protocols with errors (vs. the errors across other measures). An equally important consideration is that they examined the KTEA-II, whereas we focused on the KTEA-3.

Limitations

As with any study, there are limitations to consider. One constraint is our exclusion of Written Expression and the Oral Language subtests. We did not examine these subtests as we lacked the extant data to do so, because the teacher trainees were not required to administer them for their course. Our results likely underestimate the total number of errors that would be observed due to this limitation. However, as the most common types of SLD are in reading, math, and writing (Cortiella & Horowitz, 2014), the KTEA-3 Oral Language subtests are administered less frequently than the other subtests by special education teachers. In addition, as no research exists examining the current version of the KTEA-3, we believe our findings provide valuable preliminary information on this topic. Although, we do acknowledge that it would be beneficial for future research to examine trainee performance in these areas.

Another limitation of this study is that all participants were from the same training program with instruction provided by one instructor. This may introduce systematic bias to the data due to instructor strengths or deficits. In addition, all examinees were general education students and were excluded from testing if they had a disability or experienced learning difficulties. These elementary school students are not representative of the children who are generally administered the KTEA-3 due to the presence, or suspected presence, of disability. Both of the above limitations could affect the generalizability of these findings. Future research should examine teacher trainees from a variety of training programs across geographic locations, as well as administration on school-age children with a wider range of academic proficiency.

Implications

Training

There are implications for training that could be inferred from this study to prevent likely errors. As the highest number of errors made were clerical, it might be appropriate to have students use a checklist that requires them to check their work. In addition, errors could be addressed through interventions focused on (a) educating students about the most common errors, (b) providing individualized feedback to students based on their performance, (c) having students examine their classmates’ protocols for errors, and (d) having instructors view an entire test administration (Mrazik et al., 2012).

Furthermore, it is suggested that students and practitioners score their protocols by hand and then using scoring software (e.g., Q-global) to provide error prevention. If both methods of deriving scores provide the same results, then administrators can be confident in these scores. If they do not, then test administrators are advised to check their work further. In addition, instructors must have a systematic plan to remediate the few students who are likely to make a disproportionate number of errors. For example, overcorrection could be used with these students by requiring additional administrations, extra modeling, and close monitoring of administration and scoring. Last, Reading Comprehension subtest administration was significantly more prone to errors than other subtests. Trainers might find it useful to provide additional training and practice on this subtest. This could be completed through mock administrations that require students to make difficult decisions regarding start and end points. These testing scenarios would be aimed at eliciting decision point deliberation so that instructors can provide affirming or corrective feedback.

Research

There are a number of implications for future research. For instance, follow-up studies should examine all subtests of the KTEA-3. This is especially true of the Written Expression subtest and the oral language composite subtests, as they are more subjective in scoring and have been associated with the greatest percentage of errors (Harrison et al., 2018). Moreover, it is necessary to compare the performance of teacher trainees and those who are already in the field. As results from investigations of school psychologists have indicated that practitioners make more than twice as many errors as trainees on norm-referenced measures (Mrazik et al., 2012), it is possible that a similar pattern would be observed in special educators. Third, an examination of the difference in types and frequency of errors between school psychologist and special education teachers is needed. This would help to determine if school psychologists, due to their extensive training in psychometrics and norm-referenced assessment, have greater assessment fidelity than their teacher counterparts. Finally, it is important for future research to look beyond errors on protocols. Standardized tests require several procedural steps that cannot be detected through the protocol (e.g., properly providing instructions; Breaux & Lichtenberger, 2016), and require observations or video reviews (Harrison et al., 2018).

Much of the similar research on cognitive assessment (e.g., Loe, 2014; Ramos et al., 2009) has utilized means and parametric statistics, instead of median values and nonparametric analyses. However, our and other’s research (e.g., Loe, 2014) indicated that assessment error data may violate the assumptions of normality. For instance, our data were significantly positively skewed due to a few students accounting for the majority of observed errors. This may imply that researchers should consider error distributions in relation to measures of central tendency that could better direct analyses toward selecting best-fitting parametric or nonparametric analyses. Moreover, these data could suggest that research on intervening on error-prone students could be more fruitful that focusing on error types or specific subtests.

Conclusion

In summary, the findings from our study were similar to previous research on norm-referenced assessments, which suggests that errors are typical. However, we found that while most trainees do make some errors, they do not make a large amount per person with the exception of a few students. In addition, of the subtests we examined, Reading Comprehension was the most prone to errors in administration and scoring. Additional practice and feedback might be needed for this subtest, and additional prevention and remediation plans might be necessary for a subset of students. Further research should continue to examine error types and rates to aid in the development of evidence-based training.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Adam B. Lockwood

References

Alfonso

V. C.

Johnson

Patinella

Rader

D. E.

(1998). Common WISC-III examiner errors: Evidence from graduate students in training. Psychology in the Schools, 35, 119-125.

Alfonso

V. C.

LaRocca

Oakland

T. D.

Spanakos

(2000). The course on individual cognitive assessment. School Psychology Review, 29, 52-64.

Benson

N. F.

Floyd

R. G.

Kranzler

J. H.

Eckert

T. L.

Fefer

S. A.

Morgan

G. B.

(2019). Test use and assessment practices of school psychologists in the United States: Findings from the 2017 national survey. Journal of School Psychology, 72, 29-48.

Breaux

K. C.

Lichtenberger

E. O.

(2016). Essentials of KTEA-3 and WIAT-III Assessment. Hoboken, NJ: John Wiley.

Castillo

Curtis

Gelley

(2012). Professional practice school psychology 2010-Part 2: School psychologists’ professional practices and implications for the field. Communiqué, 40(8), 4-6.

Cortiella

Horowitz

(2014). The state of learning disabilities: Facts, trends, and emerging issues (3rd ed.). New York, NY: National Center for Learning Disabilities.

Council for Exceptional Children. (2012). CEC initial level special educator preparation standards. Retrieved from https://www.cec.sped.org/~/media/Files/Standards/Professional%20Preparation%20Standards/Initial%20Preparation%20Standards%20with%20Elaborations.pdf

Flanagan

D. P.

Alfonso

V. C.

(2017). Essentials of WISC-V Assessment. Hoboken, NJ: John Wiley.

Harrison

G. L.

Goegan

L. D.

Macoun

S. J.

(2018). Common examiner scoring errors on academic achievement measures. Canadian Journal of School Psychology, 34, 98-112.

10.

Individuals With Disabilities Education Act, 20 U.S.C. § 1400 (2004).

11.

Kaufman

A. S.

Kaufman

N. L.

(2014). Kaufman Test of Educational Achievement (3rd ed., KTEA-3). Bloomington, MN: NCS Pearson.

12.

Kaufman

A. S.

Raiford

S. E.

Coalson

D. L.

(2015). Intelligent testing with the WISC-V. Hoboken, NJ: John Wiley.

13.

Levenson

(2012). Boosting the quality and efficiency of special education. Thomas B. Fordham Institute. Retrieved from https://fordhaminstitute.org/national/research/boosting-quality-and-efficiency-special-education

14.

LoBello

S. G.

Holley

(1999). WPPSI-R administration, clerical, and scoring errors by student examiners. Journal of Psychoeducational Assessment, 17, 15-23.

15.

Lockwood

Bohan

Loke

Sealander

Lanterman

Lafoon

(2019, February). Who administers norm-referenced academic achievement assessments? Poster session presented at the National Association of School Psychologists 51st Annual Convention, Atlanta, GA.

16.

Lockwood

Farmer

(in press). The cognitive assessment course: Two decades later. Psychology in the Schools.

17.

Loe

S. A.

(2014). Examiner errors on the Reynolds Intellectual Assessment Scales committed by graduate student examiners. Psychology in the Schools, 51, 97-106.

18.

Loe

S. A.

Kadlubek

R. M.

Marks

W. J.

(2007). Administration and scoring errors on the WISC-IV among graduate student examiners. Journal of Psychoeducational Assessment, 23, 237-247.

19.

Mrazik

Janzen

T. M.

Dombrowski

S. C.

Barford

S. W.

Krawchuk

L. L.

(2012). Administration and scoring errors of graduate students learning the WISC-IV: Issues and controversies. Canadian Journal of School Psychology, 27, 279-290.

20.

National Association of School Psychologists. (2009). School psychologists’ involvement with assessment. Retrieved from https://www.nasponline.org/assets/Documents/Research%20and%20Policy/Position%20Statements/Involvement_in_Assessment.pdf

21.

National Center for Education Statistics. (2018). Digest of education statistics, 2016 chapter 2. Retrieved from https://nces.ed.gov/programs/digest/d16/ch_2.asp

22.

Oak

Viezel

K. D.

Dumont

Willis

(2019). Wechsler administration and scoring errors made by graduate students and school psychologists. Journal of Psychoeducational Assessment, 37, 679-691.

23.

Oakland

T. D.

Zimmerman

S. A.

(1986). The course on individual mental assessment: A national survey of course instructors. Professional School Psychology, 1, 51-59.

24.

Overton

(2012). Assessing learners with special needs: An applied approach (7th ed.). Upper Saddle River, NJ: Prentice Hall.

25.

Ramos

Alfonso

V. C.

Schermerhorn

S. M.

(2009). Graduate students’ administration and scoring errors on the Woodcock–Johnson III tests of cognitive abilities. Psychology in the Schools, 46, 650-657.

26.

Schrank

F. A.

Mather

McGrew

K. S.

(2014). Woodcock–Johnson IV Tests of Achievement (WJIV-ACH). Rolling Meadows, IL: Riverside.

27.

Sherrets

Gard

Langner

(1979). Frequency of clerical errors on WISC protocols. Psychology in the Schools, 16, 495-496.

28.

Slate

J. R.

Jones

C. H.

Murray

R. A.

(1991). Teaching administration and scoring of the Wechsler Adult Intelligence Scale-Revised: An empirical evaluation of practice administration. Professional Psychology: Research and Practice, 22, 375-379.

29.

Slate

J. R.

Jones

C. H.

Murray

R. A.

Coulter

(1993). Evidence that practitioners err in administering and scoring the WAIS-R. Measurement and Evaluation in Counseling and Development, 25, 156-161.

30.

Styck

K. M.

Walsh

S. M.

(2016). Evaluating the prevalence and impact of examiner errors on the Wechsler scales of intelligence: A meta-analysis. Psychological Assessment, 28, 3-17.

31.

Tabachnick

B. G.

Fidell

L. S.

(2013). Using multivariate statistics (6th ed.). Boston, MA: Pearson.

32.

Waugh

K. C.

Gronlund

N. E.

(2013). Assessment of student achievement (10th ed.). Needham Heights, MA: Allyn & Bacon.

33.

Wechsler

(2009). Wechsler Individual Achievement Test (3rd ed., WIAT-III). New York, NY: Pearson.