Abstract
Improving teacher selection is an important strategy for strengthening the quality of the teacher workforce. As districts adopt commercial teacher screening tools, evidence is needed to understand these tools’ predictive validity. We examine the relationship between Frontline Education’s TeacherFit instrument and newly hired teachers’ outcomes. We find that a 1 SD increase on an index of TeacherFit scores is associated with a 0.06 SD increase in evaluation scores. However, we also find evidence that teachers with higher TeacherFit scores are more likely to leave their hiring schools the following year. Our results suggest that TeacherFit is not necessarily a substitute for more rigorous screening processes that are conducted by human resources officials, such as those documented in recent studies.
Introduction
An extensive literature documents substantial variation in teacher quality, and evidence shows that teacher quality impacts students’ long-run outcomes (see Jackson et al., 2014 for a review). An important strategy for strengthening the quality of the teacher workforce is to identify and hire the applicants who will be most effective in the classroom. Recent studies suggest that information gathered during a district’s application-screening process could be compiled to make predictions about teacher effectiveness. Goldhaber et al. (2017) find that scores from teacher selection rubrics used to rate applicants in Spokane Public Schools predict teacher value-added to test scores and teacher retention. Using data from Washington, D.C., Jacob et al. (2018) find that applicants’ background measures (e.g., undergraduate GPA) and scores on screening measures are predictive of teachers’ evaluation scores. Bruno and Strunk (2019) use data from the Los Angeles Unified School District to show that scores from the district office’s standardized screening system are predictive of teacher impacts on test scores, evaluation scores, and attendance. Sajjadiani et al. (2019), applying machine learning techniques to data from the Minneapolis Public School District, find that the relevance of applicants’ work experience and their attributions for leaving past jobs predict teacher performance (student evaluations, observation scores, and value-added to test scores) and turnover.
These recent studies demonstrate promising ways in which information from the time of application can be harnessed to select better teachers. However, in these studies, districts’ central office human resources officials play a large role in prescreening and scoring applications. This systematic collection and scoring of applicant information for use in hiring decisions may be a barrier in some school districts. Prior research suggests that district central offices vary in how they support schools in recruiting and screening teacher applicants, and principals vary in the degree to which they use data in their hiring decisions (Cannata et al., 2017). The teacher hiring process may even be rushed and information-poor (Liu & Johnson, 2006).
To overcome this limitation, school districts across the United States are turning to the private sector for new teacher screening tools that claim to use “big data” to help identify better teachers from the pool of teacher applicants (Flanigan, 2016; Simon, 2014). Commercial screening instruments have existed for decades, such as the Haberman Star Teacher PreScreener and Gallup’s Teacher Perceiver Interview, with limited evidence suggesting that scores from these two tools are modestly related to teacher performance (Metzger & Wu, 2008; Rockoff et al., 2011; see Supplementary Appendix A in the online version of the journal for additional background). However, additional commercial screening instruments have arrived on the market, boasting data-driven screening scores (Simon, 2014). Many districts currently pay firms to use these screening instruments, which typically include assessments that applicants take while completing online teacher job applications (Simon, 2014). Despite the growing popularity of such commercial screening tools, there exists limited evidence on whether these tools are effective in predicting teacher performance.
To extend this literature, we study the extent to which applicants’ scores from a “big data” commercial screening tool, Frontline Education’s TeacherFit instrument, predict teacher outcomes in a large U.S. school district. Specifically, we ask: Are the results from the TeacherFit screening tool predictive of teachers’ evaluation scores, absences, retention in the same school/district, and impacts on student test scores?
We use data from the Wake County Public School System (WCPSS) in North Carolina, the 14th largest school district in the nation (de Brey et al., 2021). Beginning in January 2016, WCPSS required teacher applicants to take the TeacherFit assessment to submit their applications. We study the predictive validity of applicants’ TeacherFit scores among the new teacher applicant pool for WCPSS in school years 2016–2017 and 2017–2018. We find that a one standard deviation increase on an index of TeacherFit scores is associated with a 0.06 standard deviation increase in the evaluation scores that teachers receive from principals. In addition, we find evidence that teachers with higher TeacherFit scores are more likely to leave their hiring schools after the first year. We do not find a significant relationship between TeacherFit scores and either value-added to math or English Language Arts (ELA) test scores. To alleviate concerns of bias from sample selection, we estimate a Heckman selection model and find that sample selection–corrected estimates are similar to our estimates without corrections.
Setting, Data, and Measures
Frontline Education is a school administration software provider whose broad portfolio of products reaches over 10,000 clients (Frontline Education, 2021). WCPSS began using Frontline’s Web-based recruiting platform, Frontline Recruiting and Hiring, to manage job postings and applications in the summer of 2015. Beginning in January 2016, WCPSS required teacher applicants to take Frontline’s screening assessment, TeacherFit, to submit their applications via the district’s site on Frontline’s online platform. Frontline Education’s website states that their assessments are “driven by university backed research” and “leverage ‘machine learning’” (Grunwell, 2016). According to a sales webinar, the assessment creators developed the items based on interviews with subject matter experts, analysis of job descriptions, and reviews of research, followed by testing of the items on teacher and school employees (Reese, 2018). “Hundreds of thousands” of applicants have taken the assessment since it became available in 2008 (Frontline Education, 2022). However, to our knowledge, the extent to which scores from the TeacherFit instrument can identify effective teachers has not yet been documented in peer-reviewed literature.
The TeacherFit assessment, which claims to help “identify outstanding teachers” (Grunwell, 2016), takes approximately 20 to 30 minutes, does not allow for blank responses, and does not have obvious “correct” or “incorrect” answers. Rather, the items attempt to assess applicants’ attitudes, beliefs, habits, and personality traits by requiring applicants to address situational prompts and attitudinal statements by selecting Likert-type scale responses. Following the assessment, Frontline constructs the applicants’ scores and makes them available to WCPSS administrators and school principals on the online hiring dashboard. Each candidate receives an overall score and separate scores on each of the six dimensions: Fairness & Respect, Concern for Student Learning, Adaptability, Communication & Persuasion, Planning & Organizing, and Cultural Competence. Each score falls on a 1-to-9 scale. (See Supplementary Appendix A in the online version of the journal for additional background on TeacherFit.)
While applicants’ scores were made available to administrators on the hiring platform, school principals did not receive strict or explicit guidance from the WCPSS Human Resources office on how to use or interpret the scores, keeping with a tradition of a decentralized hiring process in WCPSS. While principals received communication that TeacherFit scores could help identify strong candidates, they also received messaging that the TeacherFit assessment is only one part of the hiring process, and they were free to pursue candidates that do not score well. (See Supplementary Appendix B in the online version of the journal for additional background on teacher hiring.)
We use application and administrative data to study the new teacher applicant pool for school years 2016–2017 and 2017–2018. We link teacher applicants’ scores and application information to WCPSS administrative data, which includes teacher characteristics, assignments, evaluation scores, absences, and links to students. Individuals who are not previously observed as WCPSS teachers are included in the new teacher applicant pool if they (a) submitted teacher applications in the calendar years 2016 and 2017 and/or (b) are newly hired teachers in 2016–2017 and 2017–2018. The applicant pool includes 12,548 individuals, of whom 2,367 are observed as newly hired teachers in either 2016–2017 or 2017–2018. However, TeacherFit scores are missing for 8% of the applicant pool. Therefore, 11,491 individuals, of whom 2,104 are observed as new hires across 184 schools, are eligible for inclusion in the analyses below.
Table 1 provides summary statistics of the TeacherFit scores for the individuals in the teacher applicant pool with non-missing scores (N = 11,491). The TeacherFit overall score (1–9 scale) has a mean of 6.04 (SD = 1.80), while the means of the scores on the six dimensions range from 5.23 (SD = 2.08) in Cultural Competence to 6.12 (SD = 1.82) in Adaptability. In the analyses below, we use a TeacherFit index score, which we construct by summing the scores for the six dimensions and then standardizing to have a mean of 0 and unit standard deviation.
Summary Statistics of Scores for the New Teacher Applicant Pool
Notes. The new teacher applicant pool includes 12,548 individuals who submitted teacher applications in the calendar years 2016 and 2017 and/or are newly hired teachers in 2016–2017 and 2017–2018.
p < .01. ***p < .001.
Teachers’ Outcome Measures
Our primary outcome of interest is teachers’ evaluation scores. In the North Carolina Teacher Evaluation Process, teachers must be reviewed annually by their principals or a similar designated evaluator. To construct teachers’ annual evaluation scores, we fit a Graded Response Model (GRM) on the ratings that teachers receive on each element of their Summary Rating Forms from the North Carolina evaluation regime (Kraft et al., 2020). GRM models are in the family of Item Response Theory (IRT) models that are commonly used in educational and psychological assessment. GRMs are developed for ordered categorical items, such as the five-category scale on the Summary Rating Forms in the NC Teacher Evaluation Process (Samejima, 1968). We then standardize the GRM scores within-year to have a mean of 0 and a standard deviation of 1.
We also examine additional teacher outcome measures, including the number of days a teacher is absent, retention in the same school and same district in the following year, and impacts on math and ELA test scores, when available. Retention in the same school (district) is a binary indicator of whether a teacher returns to teach in the same school (in WCPSS) in the following school year. By definition, those who remain teaching in the same school in the following year also remain teaching in the same district, WCPSS, in the following year. However, those who remain teaching in WCPSS in the following year are not necessarily teaching in the same school in the following year. To calculate teachers’ impacts on test scores, we estimate value-added models for math and ELA teachers of fourth- through eighth-grade students (see Supplementary Appendix C in the online version of the journal).
Empirical Strategy
To estimate whether TeacherFit scores are predictive of teacher-level outcomes, we use ordinary least squares to estimate
where Yjkt indicates the outcome of teacher j in school k at time t. Score j refers to teacher j’s standardized TeacherFit index scores. The coefficient of interest δ1 is the expected change in the outcome Y associated with a one standard deviation increase in an applicant’s TeacherFit index score. Tj represents a vector of indicators for the teacher characteristics of race, gender, and experience. In theory, including controls for teacher experience may account for variation in the outcome that would instead be attributable to differences in TeacherFit scores in the absence of experience controls. However, we include these to address whether and to what extent TeacherFit scores can provide additional predictive information, above and beyond what is already known from resumes at the time of hiring, such as teacher experience.
Sjkt represents a vector of annual school characteristics of teacher j’s school k, including student gender, race, Limited English Proficient (LEP) status, and special education status, aggregated to school level, along with mean school-level prior test scores and school size. δ t represents year indicators, and δ h are indicators for the number of years since being newly hired. Standard errors are clustered at the teacher level.
To alleviate concerns of bias stemming from the possibility that teachers with higher (or lower) TeacherFit scores are systematically sorting into schools or job assignments (e.g., third grade, middle/secondary math, middle/secondary ELA) that enable teachers to have better performance measures, we also estimate these models with (a) school fixed effects in place of school-level characteristics, and (b) job assignment fixed effects. In these models, the identifying variation comes from applicants who are hired into the same school and applicants who are hired into the same job assignment.
A limitation worth noting is the potential for bias stemming from the possibility that administrators may—subconsciously or otherwise—reward higher TeacherFit scores, which they observed during the hiring process, with higher evaluation scores. This could bias our estimate of the relationship between TeacherFit and evaluation scores upward. However, we suspect that this would not be a large source of bias in this specific context as administrators were not advised or guided to put much stock in TeacherFit scores.
Sample Selection Correction
Although we can observe the outcomes of interest for hired applicants, we lack information on how non-hired applicants would have performed had they been hired. In other words, we cannot examine the relationship between TeacherFit scores and teacher performance for the full range of applicants. Given that the individuals making hiring decisions in WCPSS are likely trying to select applicants whom they perceive to be of higher quality, the hiring process may introduce selection bias into our estimates. Specifically, we are concerned that low-scoring individuals who end up hired as new teachers in WCPSS, in spite of their low scores, are particularly impressive in ways that are (a) unobservable and (b) correlated with their performance as teachers. To alleviate concerns of bias from sample selection, similar to Goldhaber et al. (2017), we estimate sample selection–corrected models using a Heckman selection model (Heckman, 1979). We identify the model using a function of the school size growth among the set of schools to which applicants submit applications (see Supplementary Appendix D in the online version of the journal). The extent of school size growth among an applicant’s set of schools is predictive of the likelihood they become a newly hired teacher in WCPSS but is unrelated to on-the-job performance as a teacher. To preview, we find that sample selection–corrected estimates are similar to our estimates without corrections, alleviating concerns of large bias from sample selection.
Results
We present estimates from Equation 1 in Table 2. The odd-numbered columns include controls for school characteristics, while the even-numbered columns replace the school characteristics controls with school fixed effects and add job assignment fixed effects. As shown in Columns (1) and (2), we find that scores from the TeacherFit screening tool significantly predict teachers’ standardized evaluation scores. A one standard deviation increase in the TeacherFit index is associated with a 0.08 SD increase in evaluation scores in our baseline model. After including school and job assignment fixed effects, this estimate attenuates slightly but remains statistically significant at 0.06 SD. This magnitude is about 16% of the estimated within-teacher returns to experience after 1 year of teaching (0.38 SD, estimated with WCPSS data from 2015–2016 through 2017–2018). Columns (3) and (4) examine the relationship between the TeacherFit scores and teacher absences. The coefficients are positive and small, but the results are not statistically significant—null results that are consistent with those reported by both Goldhaber et al. (2017) and Rockoff et al. (2011).
Relationship between TeacherFit Scores and Teacher Outcomes
Notes. Clustered standard errors at the teacher-level are in parentheses. All regressions include a vector of indicators for the teacher characteristics of race, gender, and experience. School characteristics include student gender, race, Limited English Proficient (LEP) status, and special education status, aggregated to school-level, along with mean school-level prior test scores and school size. VA = value-added; FE = fixed effects.
p < .05. **p < .01. ***p < .001.
In columns (5) through (8), we examine the relationship between TeacherFit scores and teacher retention. Surprisingly, in our baseline models, we find that a one standard deviation increase in the TeacherFit index is associated with a 3.4 percentage point decrease in the likelihood of remaining as a teacher in the same school in the following year (column [5]) and a 2.4 percentage point decrease in the likelihood of remaining in WCPSS in the following year (column [7]). After including school and job assignment fixed effects, the results attenuate slightly. These retention results provide some evidence that TeacherFit scores are negatively associated with within-school and within-district teacher retention.
These findings are contrary to the retention results from Goldhaber et al. (2017), who find that higher scores on a screening rubric, which is completed by human resources hiring officials, predict an increase in district retention, as well as results from Jacob et al. (2018), who find that higher screening scores predict a higher likelihood of remaining in the hiring school. Jacob et al. (2018), however, do find that teachers with better academic background scores are more likely to leave their hiring school and more likely to leave District of Columbia Public Schools (DCPS) after their first year.
Columns (9) through (12) present results of the relationship between TeacherFit scores and value-added to math and ELA test scores of students in Grades 4 through 8, measured in student-level test score SDs. All our estimates are relatively close to 0 and are statistically insignificant. The point estimates for math value-added scores, 0.005 (column 9) and 0.004 (column 10) are equivalent to 0.025 SD and 0.020 SD, respectively, in teacher-level SDs. The point estimates for ELA value-added scores, 0.002 (column 11) and −0.002 (column 12) are equivalent to 0.014 SD and −0.014 SD, respectively, in teacher-level SDs.
Selection-Corrected Estimates
Table 3 presents our selection-corrected estimates of the relationship between TeacherFit scores and our outcomes of interest. Here, the sample of hired teachers is smaller than that included in Table 2, as it is limited to applicants who (a) submitted teacher applications in the same calendar year in which they are hired, and (b) apply to schools where we can measure the change in school size between the prior year and the time of application. We also only include newly hired teachers’ first-year in the data, and the model is fit at the person-level. In this table, we present estimates from our baseline model (i.e., including school characteristics, and absent school and job assignment fixed effects), displaying results without and with the sample correction in the odd and even columns, respectively. The magnitude of the estimates appear substantively similar without and with the sample selection correction, alleviating concerns of large bias introduced by sample selection. Given these results, we prefer the estimates presented in Table 2 from our larger and less restrictive sample.
Sample Selection-Corrected Estimates
Notes. Clustered standard errors at the teacher-level are in parentheses. All regressions include a vector of indicators for the teacher characteristics of race, gender, and experience, as well as student gender, race, Limited English Proficient (LEP) status, and special education status, aggregated to school-level, along with mean school-level prior test scores and school size. VA = value-added.
p < .10. *p < .05. **p < .01. ***p < .001.
Conclusion
We find that scores from the TeacherFit instrument have some capacity to predict teacher performance as measured by evaluation scores from principals. We find that a one standard deviation increase on an index of TeacherFit scores is associated with a 0.06 standard deviation increase in evaluation scores. However, we do not find a significant relationship between TeacherFit scores and teacher impacts on test scores. Furthermore, we find some evidence that teachers with higher TeacherFit scores are more likely to leave their hiring schools after the first year.
These results suggest that the TeacherFit commercial screening tool is not necessarily a substitute for the promising screening processes that are conducted by human resources officials as described in the studies by Bruno and Strunk (2019), Goldhaber et al. (2017), and Jacob et al., (2018). The screening scores from these more elaborate screening processes appear to be stronger predictors of desirable teacher outcomes, and investing in these screening systems may have higher payoffs than investing in commercial screening tools that may be cheaper and easier to implement. These TeacherFit results are also more modest than those documented in Rockoff et al.’s (2011) examination of Haberman PreScreener scores, though the differences in teacher characteristics between studies (WCPSS new hires vs. New York City elementary/middle math teachers) make comparison difficult. Nevertheless, our results focus on just one example of a “big data” commercial screening instrument. As districts adopt and/or continue using commercial screening tools, researchers and practitioners should monitor the predictive validity of these tools to ensure that scores from these tools contain information that can be used to improve teacher selection and retention.
Supplemental Material
sj-pdf-1-epa-10.1177_01623737221131547 – Supplemental material for Can a Commercial Screening Tool Help Select Better Teachers?
Supplemental material, sj-pdf-1-epa-10.1177_01623737221131547 for Can a Commercial Screening Tool Help Select Better Teachers? by Olivia L. Chi and Matthew A. Lenard in Educational Evaluation and Policy Analysis
Footnotes
Acknowledgements
We thank Martin West, Eric Taylor, David Deming, seminar participants at Harvard and Boston University, conference participants at the Association for Education Finance and Policy (AEFP), as well as the editors and anonymous reviewers for their valuable comments and feedback. We are especially grateful to the Wake County Public School System for providing data. All remaining errors are our own.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported here was supported, in part, by the Institute of Education Sciences, U.S. Department of Education, through grant R305B150010 to Harvard University. The opinions expressed are those of the authors and do not represent the views of the Institute or the U.S. Department of Education. Additional support came from the Multidisciplinary Program in Inequality and Social Policy at the Harvard Kennedy School.
Authors
OLIVIA L. CHI, PhD, is an assistant professor at Boston University’s Wheelock College of Education and Human Development. Her research focuses on the economics of education, teacher labor markets, and measures of teacher quality.
MATTHEW A. LENARD, MA is a PhD candidate at the Harvard Graduate School of Education. His research focuses on the causal impacts of K–12 education programs and policies.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
