Abstract
This study examines the structural validity of the 2019 Teacher MKT assessment using empirical data from a sample of 645 elementary educators in Florida (USA). The examination of structural validity was supported through a series of complementary analyses, including evaluation of the dimensionality of the underlying latent trait, item performance, reliability, and the standard error of measurement. Evidence from dimensionality analysis pointed to a borderline unidimensional structure, implying that one factor or dimension may be sufficient for measuring the underlying construct. Based on the empirical results, the 2019 Teacher MKT appears to have adequate reliability and provides useful information about a relatively wide range of teacher abilities.
Keywords
Scholars have long conjectured that teachers’ subject-matter knowledge supports effective teaching. Often split into the conceptually distinguishable constructs of content knowledge (CK) and pedagogical content knowledge (PCK), teachers’ subject-matter knowledge garners much attention in education research (Shulman, 1986). Applying the concepts of CK and PCK to mathematics, Ball et al. (2008) developed a widely accepted theoretical framework for conceptualizing teacher subject-matter knowledge in mathematics, which they named Mathematical Knowledge for Teaching (MKT). MKT features prominently in the theory of change for many mathematics teacher professional development (PD) programs (Banilower & Smith, 2006; Clarke & Hollingsworth, 2002; Schoen et al., 2024). As such, teacher MKT is often featured as an outcome of interest in evaluations of teacher PD programs.
The Knowledge for Teaching Early Elementary Mathematics (K-TEEM) test was developed to identify aspects of MKT that are most relevant to teaching mathematics at the early elementary level and measure this knowledge in ways that may be related to how that knowledge may be put into effect by teachers in the class (Schoen et al., 2017). Over the past decade, the K-TEEM has been administered to thousands of teachers as a measure of the outcomes of interest in evaluations of the impact of teacher education programs for both practicing and prospective teachers.
Flake et al. (2017) outline a framework for conceptualizing types of validity evidence for construct validation wherein evidence can be organized into three phases of construct validation: substantive, structural, and external. Using empirical data from a sample of 645 elementary educators, we describe an approach taken to examine the structural validity of the K-TEEM as part of the ongoing examination of the validity argument. We allude to substantive and external aspects of validity in this brief report; however, the present study focuses on aspects of structural validity. Specifically, this investigation focuses on item analysis, dimensionality analysis, and reliability.
Background
MKT
In their conceptualization of MKT, Ball et al. (2008) subdivided the two major domains of subject-matter knowledge and PCK into six subdomains. PCK contains three subdomains: knowledge of content and students, knowledge of content and teaching, and knowledge of content and curriculum. Knowledge of subject matter also contains three subdomains: common content knowledge, specialized content knowledge, and horizon content knowledge. Using test forms developed as part of the Learning Mathematics for Teaching (LMT; 2004) project, scholars have reported results of empirical studies that found positive associations among educator MKT and participation in professional development opportunities (Hill & Ball, 2004), MKT and instructional quality (Hill et al., 2012), and MKT and student learning (Charalambous et al., 2020; Hill et al., 2005). Investigations into the dimensionality of the latent trait(s) measured by the LMT test forms and other measures of MKT have yielded different conclusions regarding dimensionality (Charalambous et al., 2020; Hill, Schilling, & Ball, 2004a, 2004b; Izsák et al., 2019).
Why Focus on Early Elementary Mathematics?
Numerous empirical studies have shown that early mathematics achievement is a robust predictor of later mathematics achievement. Consequently, education policies in the United States are shifting to focus more on teacher preparation and certification systems for pre-kindergarten through third-grade teachers. Ambitious teaching, which aims for teachers to engage students in problem solving, reasoning, and sensemaking with multiple representations of mathematical ideas, classroom discourse, responsiveness to student thinking, and more, places a higher bar on teacher MKT than the traditional approach to teaching. Early elementary teachers are expected to teach young children about counting, the full range of word problems, place value in the base-ten number system, and the fundamental laws of the four basic mathematical operations and equality. Early elementary teachers are not expected to teach differentiation or integration of functions, polynomial equations, proportional reasoning, or even operations on fractions. Most of the existing measures of MKT focus on content that is taught in the intermediate and middle grades, such as fractions and proportional reasoning. It is reasonable to think that measures of MKT for teaching at the early elementary level should focus on the content that is most proximal to the content taught at those grade levels. For these reasons and more, there is a need for measures of MKT at the early elementary level and validation studies of measures designed to measure that type of knowledge.
K-TEEM
Test Blueprint for the 2019 K-TEEM Test Form and Final Scale
Note. Abbreviations for domains and subdomains are provided in parentheses. Individual items are named using these abbreviations with the form: subdomain_item#. One selected-response item (ISS_1) was removed during the data analysis phase.
Methods
Sample and Setting
Self-Reported Characteristics of Participating Educators (n = 645) in the 2019 First Administration K-TEEM Field Test
Note. Proportions may not sum to 1 due to rounding.
aThe Other category includes mathematics specialists and coaches, resource teachers, teachers of multiple grade levels, and 1 examinee coded as “School Administrators.”
Data Inclusion
The K-TEEM developers established a criterion requiring examinees to complete at least 70% of the assessment for their scores to be considered valid. Consequently, any participant with 30% or more missing responses was excluded from the dataset. For cases in which the proportion of missing data was below this threshold, missing item-level responses were treated as incorrect and coded as zero. Four of 649 participants in the original data file provided a response to fewer than 25% of the 32 items in the K-TEEM test. The records for those participants were removed in accordance with protocol.
Instrument
Beginning in 2014, multiple K-TEEM test forms have been designed, developed, and used to measure teacher MKT (Schoen et al., 2017). The 2019 K-TEEM form was used in the current study. The 2019 K-TEEM form was similar to the 2016 K-TEEM form (Schoen et al., 2019), differing only in that items not contributing to the final score for the 2016 K-TEEM test form were omitted when constructing the 2019 test form.
The 2019 K-TEEM test form consisted of 32 dichotomously coded items. The test form consisted of 29 selected-response items and three constructed-response items (Table 1). One selected-response item (ISS_1) was removed during data analysis, resulting in a total of 31 items contributing to the final score (see the Results section for more details).
Three items (i.e., EE_1, SMW_6, and ES_3) were item sets, each consisting of a single item stem and several related subitems. Responses from the subitems in each item set were collapsed into a single dichotomous rating of correct or incorrect for each of the three items using a pre-determined scoring guide. For example, item EE_1 included five dichotomous subitems. Examinees who responded correctly to all five subitems were judged to have responded correctly, which resulted in a value of 1 in the variable for EE_1. If they responded incorrectly to any of the five subitems, they were assigned a 0 for the corresponding value of the EE_1 variable for that examinee. The same all-or-nothing scoring procedure was used for the other two item sets (i.e., SMW_6 and ES_3).
The 2019 K-TEEM was administered through the web-based Qualtrics platform. Responses to individual items were required before advancing to the next item. Examinees were not permitted to return to previously seen items.
Analysis
The structural validity was examined through a series of complementary analyses. To examine the dimensional structure, both parallel analysis (PA) and exploratory factor analysis (EFA) were conducted. We conducted the EFA using weighted least squares mean and variance adjusted estimation based on tetrachoric correlations. A geomin rotation was applied to permit factor correlations, reflecting the more realistic structure typically observed in social science research. Multiple criteria were used to decide the number of factors to be retained, such as model-fit indices, the PA scree plot, the RMSEA difference between factor solutions, the proportion of the total variance explained by the eigenvalues, and the interpretability of the factor solution. Tetrachoric correlation coefficients were also estimated to assess the degree of interrelatedness among items.
Item-level performance was evaluated using indices derived from both classical test theory (CTT) and item response theory (IRT), allowing for a multifaceted understanding of item functioning. Reliability estimates were reported for both summed scores—using coefficient alpha and nonlinear structural equation modeling reliability—and for response pattern scores via marginal reliability. The analyses did not strictly adhere to a one-way sequence, as the findings from each analytic phase were intended to inform and refine interpretations across phases. These analyses were conducted in Mplus, flexMIRT, and R. The source data, replication code, and related output for these analyses are available at https://osf.io/ga5zc/.
Results
Initial Analysis
Item ISS_1 was removed from the final test due to its negative correlation with several other items and its low corrected item-total correlation (
Dimensionality
Parallel analysis (PA) indicated the presence of two components; however, the scree plot revealed that the first component was clearly distinct from the others—suggesting a single, dominant factor—while the second component appeared to be borderline (Figure 1). In addition, 23% of the total variance was explained by the first eigenvalue, while the second eigenvalue accounted for 6% of the total variance. Horn’s parallel analysis
The results from the one-factor and two-factor solutions from the EFA were meticulously examined. The two-factor solution did not clearly align with the MKT theoretical framework used in the test blueprint. The root mean square error of approximation (RMSEA) was 0.035 for the one-factor solution and 0.027 for the two-factor solution, yielding a difference of only 0.008. The standardized root mean square residual (SRMR) was 0.075 for the one-factor solution and 0.065 for the two-factor solution. Both indices fall within the commonly accepted cut-off values of 0.060 for RMSEA and 0.080 for SRMR, as proposed by Hu and Bentler (1999), indicating acceptable model fit for both models. However, given that these thresholds are guidelines rather than absolute standards, further evidence was sought to substantiate the appropriateness of the factor structure.
According to Finch’s (2020) criterion, which considers a difference of ≥0.015 in RMSEA as indicative of meaningful model improvement, the observed difference of 0.008 in RMSEA for the one- and two-factor solutions does not justify the added complexity of a two-factor solution. The results of PA and EFA collectively support the retention of the more parsimonious one-factor model.
Item Performance and Test Reliability
In the one-factor model, standardized factor loadings for the K-TEEM test items ranged from 0.225 to 0.629, and all the loadings were statistically significant. The magnitude and significance of these loadings provide some empirical support for the construct validity of the items in capturing the underlying latent dimension.
Further analysis, utilizing a two-parameter logistic (2PL) item response theory (IRT) model, revealed a satisfactory model fit without convergence issues. Item discrimination parameters ranged from 0.411 to 1.353, while difficulty estimates spanned from −2.477 to 1.988, indicating a diverse range of item functioning across the ability spectrum. The marginal reliability of the IRT-based ability estimates was calculated as 0.810. According to the test information function (Figure 2), the assessment yielded the greatest precision for individuals whose ability levels were centered around the mean ( Test information and conditional standard error of measurement
Complementary item-level analyses based on classical test theory were also conducted. Item difficulty indices ranged from 0.170 to 0.810, and corrected item-total correlations fell between 0.152 and 0.413, suggesting an adequate range of item performance. Coefficient alpha for the 2019 administration of the K-TEEM test was also estimated at 0.800, with a corresponding standard error of measurement of 2.476. Given the well-documented limitations of coefficient alpha—particularly its reliance on assumptions of tau-equivalence and uncorrelated error assumption—the inclusion of the model-based reliability index adds robustness to the reliability evidence of the test.
Discussion
The 2019 K-TEEM test appears to have adequate reliability and provides useful information about a relatively wide range of teacher abilities, at least for this sample of 645 elementary educators. Standard error of measurement and test information provide valuable information about the range of abilities wherein the instrument provides reasonably precise ability estimates, and that range appears to align to more than 95% of the sample.
The 2019 K-TEEM has less precision in the tails of the distribution (e.g., for educators with ability levels more than two standard deviations below the mean or more than 1.5 standard deviations above the mean). If the intent of K-TEEM was to generate precise measures of educator abilities for those in the upper tail of the distribution, the K-TEEM test form may need to be modified, possibly by adding more items with higher difficulty estimates.
Program evaluators and researchers using K-TEEM and other measures to study MKT and its correlates must make practical decisions to balance considerations about theory, measurement quality, and practical applications. Every assessment containing multiple items has some amount of multidimensionality. Determining when the multidimensionality is so extensive that it cannot be ignored in the data modeling process can be challenging in practice. In practical measurement applications, such as when assessments are used to estimate the impact of educational interventions, some of the big questions include (1) is it reasonable (and useful) to assume unidimensionality of the latent construct being measured by the assessment, and (2) how much multidimensionality can be tolerated before it impacts the scores and their interpretation?
Using various model diagnostics, we examined whether it is reasonable to model the 2019 K-TEEM data using a unidimensional or multidimensional model. The evidence from dimensionality analyses suggested a borderline unidimensional structure. Guided by the principle of parsimony and driven by the need to make a practical decision, we conclude that it is reasonable to assume that the K-TEEM measures an essentially unidimensional latent trait. We also acknowledge that this decision ultimately relies on clinical judgment and leaves us with a sense of unease.
Limitations and Future Directions
The findings reported here are limited in that they are most relevant for this one K-TEEM test form and this sample of educators. Next steps in evaluating the validity argument for K-TEEM may be to further investigate these elements of structural validity with a larger sample and investigations of measurement invariance. Future research with K-TEEM is also needed to investigate aspects of external validity, such as whether the K-TEEM scores can be used to detect treatment effects or are associated with instructional practice or student learning outcomes.
This finding of borderline unidimensionality adds yet another example to the ongoing discussion about the MKT theoretical framework and whether the underlying latent traits measured by various assessments of MKT are measuring multidimensional or unidimensional traits (Charalambous et al., 2020; Copur-Gencturk et al., 2018; Hill et al., 2004a, 2004b; Izsák et al., 2019). More empirical study is needed to further explore and reconcile the MKT theoretical framework with the dimensionality of the latent trait(s) measured by K-TEEM (and other measures of MKT).
Footnotes
Acknowledgments
In her role as project manager, Amanda Tazaz coordinated the informed consent and data collection process. Kristy Farina and Nancy Donahue assisted with data cleaning. Gizem Solmaz-Ratzlaff assisted with preparing the data and related files for sharing through the data repository. The authors are grateful to the educators who participated in the study and made this research possible.
Ethical Considerations
The study involving human participants was reviewed and approved by the Institutional Review Board of Florida State University (FWA No. 00000168, STUDY00000009).
Consent to Participate
Written informed consent to participate in this study was provided by the participants.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Data were collected with financial support from the U.S. Department of Education Supporting Effective Educator Development program through grant award number U423A180115. The opinions expressed are those of the authors and do not represent the views of the U.S. Department of Education.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The data that support the findings of this study are openly available in Open Science Framework at https://doi.org/10.17605/OSF.IO/GA5ZC (Schoen et al., 2022).
