Abstract
Theories of reading and writing development suggest that the factor structure of achievement batteries could change across development. As a result, it is important to test achievement batteries for invariance across development. The purpose of these analyses is to determine whether the factor structure of reading, writing, and oral language measures is invariant across grade ranges in the Kaufman Test of Educational Achievement, 3rd Edition. Results suggest that multiple interpretational models demonstrate measurement and structural invariance. Implications for practice are discussed.
Keywords
Models of reading and writing development suggest that relationships between component reading/writing skills vary across development (e.g., Berninger, 1999; Hoover & Gough, 1990). Despite this variability, achievement batteries’ measurement models may not account for these developmental changes. It is possible that constructs measured by achievement batteries change across development, and is important to establish age/grade-based measurement invariance.
Reading/writing competence depends on both language skills and decoding/spelling, respectively. The simple view of reading conceptualizes reading comprehension as the product of word reading and listening comprehension skills (Hoover & Gough, 1990). Catts, Hogan, and Adlof (2005) reported that these skills account for more than 70% of comprehension variance. Gough, Hoover, and Peterson (1996) initially demonstrated how the effects of these skills change across development. In their meta-analysis, word reading’s correlation with reading comprehension decreased from .61 in first grade to .39 in college, whereas listening comprehension increased from .41 to around .60. Garcia and Cain (2014) confirmed this developmental shift in a meta-analysis of 110 different studies. They reported that language skills begin to be better comprehension predictors than word reading around age 10, when students begin to demonstrate fluent decoding. Similarly, Berninger (1999) stressed that transcription skills (spelling/handwriting) demonstrate effects on essay composition performance, alongside language skills related to idea generation. The magnitude of these effects changes across development, because age correlates positively with decoding and spelling. In a series of cross-sectional studies, Berninger (1999) and her colleagues (Berninger, Cartwright, Yates, Swanson, & Abbott, 1994; Berninger, Whitaker, Feng, Swanson, & Abbott, 1996; Berninger et al., 1992) reported a decreasing influence for transcription skill on both composition fluency and quality across ages. Transcription accounted for 66% and 25% of fluency and quality variance, in early grades, but only 16% and 18% in junior high school students. Collectively, these findings suggest that decoding/comprehension may represent an appropriate reading composite at young ages, but not in older students. Writing may demonstrate the same change as students get older.
However, other analyses suggest that these developmental effects may not change the structure of achievement batteries across age/grade levels. Independent exploratory factor analyses (EFAs) of achievement batteries often do not demonstrate clean separations between decoding/spelling measures and reading comprehension/writing tasks. For instance, higher order analyses with the two most recent Woodcock–Johnson batteries described broad reading/writing and language/knowledge factors (Dombrowski, 2015; Dombrowski, McGill, & Canivez, 2018; Dombrowski & Watkins, 2013). However, the relatively large age ranges in these analyses may mask changes in age groups.
This lack of separation between decoding/spelling and language skills may occur because the skill areas are not necessarily independent of each other. Strong decoding skills allow readers to allocate more cognitive resources to comprehension (Garcia & Cain, 2014; Vellutino, Tunmer, Jaccard, & Chen, 2007). Similarly, transcription skills may constrain language effects on writing composition (Hayes & Berninger, 2009). When young writers are allowed to dictate their thoughts, removing the constraints of transcription skills, they tend to produce better text as more working memory is available for composing (Berninger, 1999; De La Paz & Graham, 1995). Along these lines, researchers reported indirect effects of language/crystalized intelligence skills on reading comprehension, mediated through decoding skills, via structural equation modeling (Floyd, Meisinger, Gregg, & Keith, 2012; Hajovsky, Reynolds, Floyd, Turek, & Keith, 2014).
Factor Structure of the Kaufman Test of Educational Achievement, 3rd Edition (KTEA-3)
The purpose of these analyses is to determine whether these developmental shifts occur within the KTEA-3 (Kaufman & Kaufman, 2014a) to a degree that affects measurement invariance across grade levels. Part of the KTEA-3’s construct validity evidence included a series of confirmatory factor analyses (CFAs; Kaufman & Kaufman, 2014b). Those analyses modeled one to four factors fitted to a sample of examinees between the ages of 6 and 25. A four-factor model best fits the normative sample. The oral/written language portion of this model (see Figure 1) included an oral language factor (listening comprehension, oral expression, associational fluency), a reading factor (word reading and comprehension), and a written language factor (spelling and written expression). This model also required residual correlations between spelling/word reading, reading/listening comprehension, and written/oral expression measures to fit the sample well. These residual correlations reflect supplemental composites that are also included in the KTEA-3. Reading/listening comprehension measures reflect the comprehension composite, whereas written/oral expression skills represent the expression composite. A model based on these supplemental composites has not yet been tested.

KTEA-3 latent variable models.
As the manual analysis included such a large age range, it might mask the aforementioned developmental changes. The manual model may fit younger age groups well, when reading comprehension is highly related to word reading, but misrepresent the academic skills of older examinees. Alternatively, the supplemental composites may not describe the skills of younger students as well it as they might for older students, because listening and reading comprehension skills are more distinct in earlier grades (Garcia & Cain, 2014).
To determine whether these developmental trends affect the KTEA-3 factor structure, these analyses test measurement and structural invariance of KTEA-3 factors across grades. Tests of measurement invariance can determine whether the KTEA-3 factor structure accurately assesses its constructs to the same degree across grade levels, whereas tests of structural invariance can demonstrate whether aspects of measured constructs might differ across grade levels (Keith, 2015). These analyses included three different models. The first model was described in the KTEA-3 manual, whereas the second represents its inverse. In the second model, the manual model’s residual covariances are modeled as factors, whereas its factors are modeled as residual covariances. Finally, because the inverse model operationalizes the KTEA-3 supplemental composites slightly differently than the manual, a model with supplemental composites defined consistent with the manual is also included.
Method
Participants
Participants included examinees from the KTEA-3 age-based standardization sample in grades kindergarten through 12 (n = 1,727). The sample was stratified on demographic variables consistent with the U.S. Census Bureau’s 2012 estimate. Detailed information is provided in the battery’s technical manual (Kaufman & Kaufman, 2014b). Approximately half of these participants completed Form A (n = 865) and half completed Form B (n = 862), which were linked by equating studies during the standardization process (Kaufman & Kaufman, 2014b). The K-12 data set was subdivided into four grade bands: K-2 (n = 482), Grades 3 to 5 (n = 457), Grades 7 to 9 (n = 384), and Grades 9 to 12 (n = 404). These ranges were selected to create relatively large samples of participants in groups generally consistent with the structure of schooling in the United States (e.g., elementary, middle, high school).
Measures
These analyses included the reading, writing, and language measures in the core battery (Kaufman & Kaufman, 2014b). The Letter and Word Recognition subtest measures examinees’ ability to recognize letters as well as regular and irregular words. It demonstrates an average split-half reliability of .97 across pre-K through 12. The Reading Comprehension subtest requires young examinees to match words and short sentences to pictures. It also requires examinees to read passages and answer literal and inferential questions. It demonstrates an average split-half reliability of .88. The Written Expression subtest demonstrates an average split-half reliability of .86. It requires examinees to respond to a number of prompts for various writing skills presented in a storybook format. It also requires examinees to write an essay. The Spelling measure requires examinees to write words with regular and irregular patterns and has a split-half reliability of .95. Listening Comprehension requires examinees to listen to passages of formal speech and answer comprehension questions. Its format is similar to the Reading Comprehension task. Listening Comprehension demonstrates an average split-half reliability of .85. The Oral Expression measure requires examinees to describe a photograph and may specify target words examinees must include in their description. It demonstrates an average split-half reliability of .81. The Associational Fluency subtest requires examinees to quickly name category exemplars. It displays an average split-half reliability of .62. The Nonsense Word Decoding subtest measures examinees’ ability to read words that have no meaning, but conform to regular phonics patterns in the English language. It demonstrates an average split-half reliability of .96.
Model Development, Analyses, and Evaluation
These analyses included three models (Figure 1). The first was the same as the language-based portions of the four-factor model provided in the test manual (Kaufman & Kaufman, 2014b, Figure 2.1). The second represented its inverse. This model specified the manual model’s residual covariances as factors, and its factors as residual covariance. The Associational Fluency subtest was specified as part of the expression factor, though it is important to note that it is not part of that factor, as defined by the test manual (Kaufman & Kaufman, 2014b). The third factor included Spelling and Letter Word Recognition, labeled Decoding (though this is not the same Decoding factor as in the test manual, which does not include Spelling). Because it uses the same subtests, the inverse model can be compared with the manual model, but it does not exactly reflect the Decoding and Expression composites provided by the KTEA-3. Thus, the third model removes the Associational Fluency subtest, and replaces Spelling with the Nonsense Word Reading subtest. These modifications allow for a model that included the Comprehension, Expression, and Decoding composites provided by the battery.
Data were analyzed with the lavaan package (Rossell, 2012) in R (R Development Core Team, 2015), and included age-based standard scores. As an initial test of equivalency across grade levels, the subtest means, variances, and covariances were constrained to be equal across grade ranges, similar to Box’s M test (Keith, 2015). Next, the models were fitted to each grade range individually. Then a multigroup model was calculated to test configural invariance. Subsequent models sequentially constrained (a) unstandardized factor loadings, (b) manifest variables’ intercepts, and (c) residuals and residual covariances. To assess structural invariance, (a) factor variances, (b) covariances, and (c) means were constrained to be equal sequentially.
Fit statistics were interpreted consistent with Keith’s (2015) guidelines. They include a chi-square test, where a high p value suggests adequate model fit; the root mean square error of approximation (RMSEA), where values lower than .08 reflect an adequate fit and lower than .05 reflect an excellent fit; the comparative fit index (CFI), where values higher than .95 reflect appropriate fit; and the standardized root mean square residual (SRMSR), where values less than .08 suggest an appropriate fit. Nested models can be compared via a Δχ2 as well as a ΔCFI, where a value less than .01 reflects no significant change in fit. Nonnested models can be compared via the Akaike information criterion (AIC), which favors the model with the lower value.
Results
Data Screening and Missing Values
In the K-12 sample, values for subtests’ skewness and kurtosis indicated that each distribution was generally normal, ranging from –.49 to .13, and .25 to .90, respectively. Standard deviations (SDs) ranged from 14.58 to 15.49. There were 12 missing values in the K-12 sample (0.007%). Little’s (1988) MCAR test, as implemented by the BaylorEdPsych package (Beaujean, 2012) indicated data were missing completely at random. Missing data were dealt with via full information maximum likelihood (FIML) as it makes use of all available data from each participant when estimating model parameters (Beaujean, 2014).
Model Testing
Table 1 lists fit statistics for the test of covariance equivalence across grade levels. Although strong fit values would indicate that KTEA-3 measures similar constructs to a similar degree across grade ranges, this initial test for model fit was equivocal. The chi-square test was significant, a sign of poor model fit, and though chi-square may be influenced by the large sample size, the SRMSR was also greater than the .08 cutoff for adequate model fit. Alternatively, at .078 and .956, the RMSEA and CFI suggested an adequate, though not excellent, fit.
Manual and Inverse Model Fit Statistics.
Note. RMSEA = root mean square error of approximation; CFI = comparative fit index; AIC = Akaike information criterion; SRMR = standardized root mean square residual.
Released variance constraints on Reading Comprehension, Letter and Word Recognition, and Spelling subtests across grade ranges.
Released variance constraints on Listening Comprehension, Reading Comprehension, and Letter and Word Recognition subtests across grade ranges.
Manual model
The manual model fit varied across grade ranges according to the fit indexes in Table 1. It fit the third- to fifth- and sixth- to eighth-grade ranges extremely well, and the K-2 and Grades 9 to 12 ranges adequately. Table 2 contains the factor loadings, variances, and covariances for the manual model across grade levels. A review of coefficients suggested a high level of consistency across groups. The residual variances of Letter and Word Recognition, Spelling, and Reading Comprehension appeared to vary across grade ranges, however, based on a review of unstandardized coefficients.
Manual Model Factor Loadings, Covariances, and Variances Across Grade Ranges.
Note. Est = unstandardized estimate; Std = standardized; Read = Reading Composite; RC = Reading Comprehension; LW = Letter/Word Recognition; Write = Writing Composite; WE = Written Expression; SP = Spelling; Lang = Language Composite; LC = Listening Comprehension; OE = Oral Expression; AF = Associational Fluency. Bolded values represent p > .05.
The resulting model fit for tests of invariance are also included in Table 1. The model demonstrated configural, metric, and intercept invariance across grade ranges, based on the pattern of fit and ΔCFI. Of course, intercept invariance would be expected, given that the KTEA-3 is normed so that the mean of each age/grade level is the same. The model was not invariant across grade ranges when subtest residuals and residual covariances were constrained across grade ranges. Because the variances of Letter and Word Recognition, Spelling, and Reading Comprehension appeared to differ across grade ranges, they were released to test a partial invariant model. Removing these constraints achieved partial residual invariance (see online supplement for additional results). It is important to highlight that some methodologists do not consider residual invariance to be a critical part of measurement invariance testing (Keith, 2015; Widaman & Reise, 1997). Models constraining latent factor variance, covariance, and means were all within acceptable fit.
Inverse model
The inverse model fit also varied across grade ranges according to fit indexes in Table 1. The pattern was the same as with the manual model. It fit well in the third- to fifth-grade and sixth- to eighth-grade ranges, and adequately in Grades K-2 and 9 to 12. Comparing manual/inverse model AIC values, it is interesting to note that the manual model fit better in Grades K-2, 3 to 5, and 6 to 8 ranges, but the inverse model fit better in the ninth- to 12th-grade range. Based on the coefficients provided in Table 3, in this model, Listening Comprehension’s loading on the Comprehension factor appeared to vary substantially across grade ranges. The residual variances for Reading Comprehension, Listening Comprehension, and Letter and Word Recognition also appeared to vary, based on a review of unstandardized coefficients.
Inverse Model Factor Loadings, Covariances, and Variances Across Grade Ranges.
Note. Est = unstandardized estimate; Std = standardized; COMP = Comprehension Composite; RC = Reading Comprehension; LC = Listening Comprehension; EXP = Expression Composite; WE = Written Expression; OE = Oral Expression; AF = Associational Fluency; DECODE = Decoding Composite; LW = Letter/Word Recognition; SP = Spelling. Bolded values represent p > .05.
Model fit statistics for the inverse model’s tests of invariance also appear in Table 1. Just like the manual model, this model demonstrated configural, metric, and intercept invariance. Despite the apparent variability in Listening Comprehension’s factor loadings across grade ranges, constraining them to be equal within the metric invariance model did not degrade fit, according to the ΔCFI. The inverse model did not demonstrate invariance of residuals and residual covariances. Releasing the variances of Reading Comprehension, Listening Comprehension, and Letter and Word Recognition across groups resulted in partial residual invariance. Models constraining latent factor variance, covariance, and means were all within acceptable fit. Interestingly, comparing the fit measures of the manual and inverse model suggests that the manual model demonstrated a slightly better fit for measurement invariance tests, whereas the inverse model displayed a better fit for structural tests, though these differences appear minimal. See the online supplement for model factor loadings.
Supplemental model
The fit statistics for the supplemental model are listed in Table 4, and model coefficients are listed in Table 5. The Comprehension factor is the same as in the inverse model, and Listening Comprehension demonstrated the same apparent factor loading variability across grade levels. Listening Comprehension’s variance also demonstrated the greatest apparent variability across grades. In this model, the covariance between Letter and Word Recognition and Reading Comprehension (e.g., the Reading factor in the manual model) was nonsignificant in all grade ranges.
Supplemental Model Fit Statistics.
Note. RMSEA = root mean square error of approximation; CFI = comparative fit index; AIC = Akaike information criterion; SRMR = standardized root mean square residual.
Released variance constraints on listening comprehension variance across grade ranges.
Released constraints of covariance between comprehension and decoding across grade ranges.
Supplemental Model Factor Loadings, Covariances, and Variances.
Note. Est = unstandardized estimate; Std = standardized; COMP. = comprehension composite; RC = reading comprehension; LC = listening comprehension; EXP. = expression composite; WE = written expression; OE = oral expression; DECODE = decoding composite; NWD = nonsense word decoding; LW = letter/word recognition; SP = spelling. Bolded values represent p > .05.
The supplemental model demonstrated configural, metric, and intercept invariance. As with the other models, its residuals and residual covariances were not invariant across grade level, based on the ΔCFI. This model required a release of the equality constraints on the variance of Listening Comprehension. Structurally, this model did not demonstrate invariance in covariances between factors. Releasing constraints on the covariance between Comprehension and Decoding allowed for an acceptable model fit. For this component of the model, the K-2 grade range demonstrated the greatest association between the two constructs, whereas the sixth- to eighth-grade range displayed the smallest association (see online supplement).
Discussion
Developmental effects within reading and writing domains (e.g., Berninger, 1999; Hoover & Gough, 1990) suggest that factors measured by achievement batteries could vary across age/grade ranges. For instance, because decoding is more strongly associated with comprehension in younger students, and language skills are more strongly associated with comprehension in older students, loading of reading comprehension and word reading performance (or reading and listening comprehension) on a common factor could change across grade levels. The purpose of these analyses was to determine whether there were developmental effects within the KTEA-3 by assessing measurement and structural invariance across grade levels for three different models. These models included the oral/written language portions of the manual model, its conceptual inverse, and a model that included the KTEA-3 comprehension, expression, and decoding supplemental factors.
Although there were potentially small fit differences across grade ranges, collectively, these results suggest that the measurement properties of the KTEA-3 are generally invariant across grade ranges. Consistent with the reviewed changes in reading and writing development, the inverse model demonstrated a slightly better fit in the ninth- to 12th-grade range, and the supplemental model fit that grade range extremely well. However, when equality constrains were added across grade levels, this difference in model fit did not create invariance. Generally, the only developmental differences observed in these analyses involved subtests’ residuals. As Keith (2015) explained, residual invariance is not considered critical in measurement invariance.
Structurally, both the manual and inverse models were also invariant. The factors described in these models demonstrated the same amount of variance across grade levels and are associated with each other to the same degree across groups. Factor means were also invariant, though this should be expected, as the KTEA-3 is normed explicitly so that mean scores are the same across age/grade. The supplemental model demonstrated variability in the relationship between the decoding and comprehension factors across grade ranges.
Implications for Practice
Clinicians can be comfortable using the KTEA-3 across grade levels and interpreting both its core academic composites, and its supplemental cross-domain composites related to comprehension, expression, and also the decoding composite. These results indicated that the composites reflect the same skills across grade levels.
These analyses not only suggest that both sets of composites might be useful to interpret in clinical practice but also indicate that interpretational caution is needed. The small fit difference between the models and the need for correlated residuals indicates that subtests could contain systematic variance for multiple abilities. Reading Comprehension performance may reflect a general reading ability and comprehension ability. Spelling may reflect general writing skill and decoding. These additional abilities, reflected in the supplemental composites, underscore Berninger and colleagues’ (2006) insight of the artificial distinction between language, reading, and writing. Clinicians may find it challenging to determine the degree to which these abilities affect examinee performance on a measure. A model of reading assessment, such as that offered by Kilpatrick (2015), might capitalize on multiple abilities.
Limitations and Future Directions
These findings should be interpreted in light of a number of limitations. First, only three models were tested. Possibly, other models may provide a strong fit to the KTEA-3 standardization sample. For instance, the large correlations between latent factors suggest the presence of a higher order factor. Second, these models only include a subset of the subtests included with the KTEA-3. These results might change if additional subtests, such as Reading Vocabulary, were also included in the model. These additional subtests were excluded, because they were excluded in the manual model. Third, results may differ in other samples, such as a sample of students with disabilities. If these analyses were replicated with students with word reading disabilities, fit differences between these models may be greater across development. Fourth, as an anonymous reviewer hypothesized, if grade bands were grouped based on the raw scores where item types change, or where there are large differences between one grade and another, it might alter these results. These natural breaks in raw scores may be masked when converted to standard scores.
Next steps for research could include alternative analyses of the abilities measured by the KTEA-3, and replication in diverse groups. Because the KTEA-3 includes multiple new subtests, many of which were not included in the manual factor analysis, and because these subtests appear to measure multiple abilities, an EFA may provide additional insight to the abilities measured by the KTEA-3.
Conclusion
The KTEA-3 is a strong measure of academic achievement. The results presented here suggest that a subsection of its reading, writing, and oral language measures may reflect multiple abilities, though their measurement is invariant across grade ranges. Clinicians can feel comfortable that the interpretation of these scores is similar across development. This information could be bolstered by additional exploratory analyses that include the full battery of subtests.
Footnotes
Author’s Note
Jason R. Parkin, Department of Educational Psychology, University of Washington. Standardization data from the Kaufman Test of Educational Achievement, 3rd Edition (KTEA-3). Copyright © 2014 NCS Pearson, Inc. Used with permission. All rights reserved.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplementary material is available for this article online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
