Abstract
Using data from a large southwestern school district, Bui, Craig, and Imberman investigated the effects of gifted and talented programming on middle school students’ achievement and behavior (attendance and discipline) through two avenues. Using a regression discontinuity design for the first set of analyses, the authors took advantage of a discontinuity in eligibility requirements at the cut-score; the second set of analyses used data from randomized lotteries of two oversubscribed Gifted and Talented (GT) magnet schools. Findings from both sets of analyses revealed no discernable impact on student outcomes, with the exception in the area of science for the students attending the GT magnet schools. This commentary highlights both the methodological and practical considerations of the study’s findings and the importance of policies and procedures being based on high-quality evidence sources.
Keywords
History of the Topic
As noted by Preskill (2008), there has been a “social epidemic of evaluation” (p. 1229) across the country with renewed focus on the part of educational policymakers wanting evidence to support the claim that monies spent for programs are in fact worth the dollars. Likewise, there remains a continued call from leaders in the field of gifted education (e.g., Callahan, 2013; Robinson, Shore, & Enersen, 2007; VanTassel-Baska, 2006) for school district personnel to engage in evaluating their gifted programs, so that services for advanced learners can be improved or maintained at a high level. In 2014, the U.S. Department of Education commissioned researchers (Callahan, Moon, & Oh, 2014) to carry out a national survey of local education agencies’ (LEAs) practices regarding serving gifted students, with one of the areas focusing on “evaluation and program improvement.” Callahan et al. (2014) reported that of the district responders, more than 50% reported that they did not have any program evaluation requirements or strategic plans in place to monitor program activities. Of those who reported having an evaluation requirement, in the majority of cases the requirement was fulfilled by a limited scope of internal evaluation carried out by internal program personnel. Furthermore, roughly half of the respondents indicated no plans for an evaluation of their gifted programs within the next 12 to 18 months.
Reasons for LEAs not systematically evaluating their gifted programs have been stated by several in the field. For example, Callahan, Bland, Hunsaker, Tomlinson, and Moon (1997) found that many district staff lacked personnel with adequate knowledge and skills for conducting evaluations; Tomlinson, Bland, Moon, and Callahan (1994) found a failure on the part of LEA personnel to disseminate evaluation findings when one had been conducted. Adelson, McCoach, and Gavin (2012) suggested that the lack of empirical evidence regarding gifted programming effectiveness is due to the difficulty in controlling variables that might affect outcomes due because of the inability to randomly assign gifted students to programs. As Adelson et al. note, Vaughn, Feldhusen, and Asher (1991) remarked on the realities of establishing control groups as “the ethical problem of withholding service from qualified students” (p. 92). While this ethical issue is one that remains today (and most likely throughout the time), some work (Adelson et al., 2012; Bui, Craig, & Imberman, 2011) has been conducted using “ex post facto” designs with archived data to investigate potential programming effects.
Background of the Study
Using administrative records from the academic years of 2007-2008 through 2009-2010, Bui et al. (2011) applied a regression discontinuity (RD) design to compare the performance of students who were just “barely” eligible (i.e., met cutoff) for selection into the district’s middle school gifted program and those who scored just below the cutoff for the purpose of estimating the effects of program participation on student achievement, attendance, and disciplinary infractions. In addition, the researchers took advantage of randomized lotteries to select who attended two oversubscribed GT magnet programs to determine program effects on achievement and attendance. Specifically, the authors compared students who were selected by lottery to attend one of the GT magnet schools to those students who were not selected (i.e., lost the lottery but were eligible) to attend one of the GT magnet schools and attended a neighborhood gifted program, another magnet school based on a different specialty area, or a charter school. The district implemented a universal fifth-grade screening and identification process based on a matrix score that was a combination of achievement scores, grades, a nonverbal assessment score, teacher recommendations, and obstacle points (e.g., socioeconomic status).
Methods
Using archived data from a large southwestern urban school district, Bui et al. (2011) investigated gifted program treatment effects on students’ academic achievement and behavior (i.e., attendance and discipline). Two approaches were used to address different but related questions. The first approach to investigating treatment effects was based on a “fuzzy” RD design with approximately 2,600 “barely eligible” seventh-grade middle school students and 5,500 students in 2 sixth-grade cohorts (defined as students who scored within a 20-unit band around the district’s established cutoff—10 units above and 10 units below). As noted by Bui and colleagues, the “fuzzy RD” model (Lee & Lemieux, 2010) was employed because there is not a direct relationship between matrix scores and actual placement in the GT program due to a variety of factors (e.g., alternative tests, appeals). However, the authors note that using a 10-point band above and below (20-point bandwidth) captured a fairly linear relationship between the range and GT status. Given this bandwidth, according to the authors, the resulting inter-quartile ranges ran between the 69th and 89th national percentile rank (NPR) in reading and the 81st and 94th NPR in mathematics.
The second approach took advantage of randomized lotteries which determined who was admitted to two oversubscribed GT magnet schools within the district (542 students—394 of whom were offered admission). Specifically, the second set of analyses compared students who lost the lottery and attended either a neighborhood GT program or another magnet school with a different focus area to students who attended one of the GT magnet schools.
Results and Major Implications
Results from the first investigation (RD) indicated that the effects of GT program participation with students barely eligible were near 0 in four of the subject areas (0.06 standard deviations [SD] for mathematics, 0.09 SD in reading, 0.15 SD in language, 0.13 SD in social studies), with science showing a small positive effect (0.23 SD). In other words, after 1.5 years of participation in a middle school gifted program, there was no difference in achievement between students who had participated in the program and their peers who had barely missed the established cutoff in reading, mathematics, and social studies. Although the results were not statistically significant, there was also a decline in achievement scores of program participating students in these three areas. Upon looking at specific student subgroups (i.e., females, males, free/reduced lunch program or not, Black, Hispanic, White, GT participation in fifth grade or not), little impact evidence was verified.
Using a two-stage least-squares estimate approach for the lottery analysis of admission into a GT magnet middle school comparing “winners” and “losers,” similar findings to the RD analysis were reported. Again, evidence indicated that attending the GT magnet school had little impact on students’ academic achievement in four of the five subject areas. The one subject area where a positive effect was found was in science, with an effect size of about 0.25 of a SD greater for students attending the oversubscribed magnet schools. The authors however note that these estimates are imprecise, given the small sample size involved in the analysis.
Caveats and Limitations
As is always the case with research, study limitations warrant acknowledgment as they influence the interpretation and generalization of a study’s findings. Furthermore, it is important to understand contextual or environmental variables that may affect the quality of data, thus affecting the interpretation and generalization of findings. From a methodological approach, Bui et al. (2011) carried out analyses to verify several important aspects of the RD design to minimize the violation of the statistical assumptions upon which the RD design is built. For example, the authors used a “fuzzy” RD design in trying to account for the issue that some “identified” students might not have received treatment and some “comparison” students (i.e., crossovers) actually received treatment. They investigated the potential manipulation of the forcing variable (i.e., GT qualification) by providing statistical tests and graphical displays of the discontinuities. Specifically, the authors found that the differences in density around the discontinuity point were similar to changes throughout the distribution; that there was no relationship between student demographic characteristics (race/ethnicity, gender, limited English proficiency, fifth-grade GT status, special education status, or federal lunch program) and GT status; that there is no relationship between GT status and discipline or school attendance. The authors also investigated the potential of teacher manipulation of recommendations, thus affecting the forcing variable, and in some situations created “synthetic matrix points” to accommodate instances where teachers might have had influence on the forcing variable. The potential of missing data and its impact on the findings were also investigated with no significant effects on manipulation uncovered. Overall, the authors concluded that manipulation of the forcing variable was unlikely. Each of these steps indicates that the authors were attentive to methodological concerns. However, there are two sides to each coin, and in this study there are additional considerations that have direct impact on findings that should be considered prior to making broad generalizations about gifted programming in general.
Practical Concerns
While the authors demonstrated attentiveness to the specific statistical procedures carried out in the study and followed acceptable statistical practice (i.e., checking statistical test assumptions), there is always debate on handling specific instances where data do not meet statistical assumptions and how those violations are handled. For this particular study, the authors appear to have fully disclosed all decisions that they made in terms of data manipulation where one could replicate the study given access to the data. However, the main consideration prior to the RD design analyses should have been the use of district matrix scores themselves without considering their validity in terms of measuring “giftedness” and alignment with the district’s definition of giftedness. While the matrix scores allowed for group assignment (eligible for services or not), research has documented the invalidity of grades (e.g., Cizek, 1996; Cizek, Rachor, & Fitzgerald, 1995; Cross & Frary, 1996; Frary, Cross, & Weber, 1993; Friedman & Frisbie, 1995, 2000) and the biases of teachers in making recommendations 1 for gifted programs (e.g., Siegle & Powell, 2004), both of which were components of the school district’s matrix score process.
Although the use of multiple measures (akin to those used in a matrix score process) to increase reliability has been widely supported by professional organizations (e.g., American Educational Research Association, American Psychological Association, and National Council of Measurement in Education, 2014), there is also the challenge of how to logically combine different measures (see Henderson-Montero, Julian, & Yen, 2003, for a full discussion). Furthermore, much has been written in the field of gifted education regarding the inappropriate use of matrix scores for gifted identification (e.g., Callahan & McIntire, 1994; Feldhusen & Baska, 1989; Passow & Frasier, 1994; Moon, 2013; VanTassel-Baska, 2007) as well as the use and combination of multiple measures for high-stakes decisions (e.g., Chester, 2003; McBee, Peters, & Waterman, 2014). For districts that persist in the use of the matrix process for gifted identification, the typical procedure of combining measures to arrive at a matrix score involves assigning points based on performance on each measure without consideration of a number of factors, three of which are (1) the validity of the individual components that make up the matrix score in terms of defining “giftedness”; (2) the relative contribution of each assessment, if a valid indicator of giftedness (see #1) in measuring the construct of “giftedness” as defined by the district; and (3) the lack of a systematic process for establishing predefined cut-scores or assignment of scores to certain levels of performance. Taken together, these issues with matrix scores increase the potential of mislabeled students; that is, students identified to receive services when in fact they should not receive those services, and many not identified that should have been identified to receive services. Thus, the issues of false positives and false negatives have to be taken into consideration when drawing any definitive conclusions based upon the district data.
The use of a “fuzzy” RD design in this study was an acknowledgment that within the realities of a gifted identification process, there are instances where students who made the cutoff opt to not receive services (atypical) and there are instances where students who did not make the cutoff (and in theory should have been in the comparison group; referred to in RD design as “crossovers”) actually do get placed in a gifted program often because of political persuasion (e.g., strong parent advocacy). This practice in many programs across the country and the assumption that this district is not an anomaly call into question the real suitability of students for the services being provided.
The lack of actual experience in the field of gifted education on the part of Bui et al. (2011) may in fact undermine some of the study’s findings for applicability in the practicing world. Again, in an examination of Figure 1 of Bui et al. (see manuscript Gifted and Talented Matrix for GT Entry), several questions warrant investigation to ensure that the final matrix score really identified students who need opportunities beyond what the regular classroom curricula provide.
Practical consideration questions why the matrix scoring scheme awards higher points for superior performance in the areas of reading and mathematics than for the same level of performance in the areas of science and social studies? Is this because the gifted program focuses predominantly on reading and mathematics? This scoring schema seems to ignore the overlap of skills in the areas of reading and social studies (or science) or the areas of mathematics and science (or social studies). From a practical position, one would expect to see little program effects in less emphasized content areas. Specifically, why does a student receive 12 points for achieving a 95th NPR in reading or mathematics but only 8 points in science or social studies? Why are the ranges of performance for the ability tests not equally distant, and what impact does this unequal distribution have on the potential of identifying a student or not? To what degree does the awarding of obstacle points (e.g., socioeconomic status, English language proficiency) create a situation where there may be a misalignment between students’ needs and the services provided or a confounding of other services (e.g., special education)? What role does grading bias on the part of teachers play in students’ report cards, particularly given that in this particular district process a fairly significant amount of report card points can be awarded (up to almost one third of the needed points for GT qualification), what is the real contribution of teachers, given that not only do they provide recommendations but also have direct manipulation over grades and is this contribution more than other pieces of data that are more psychometrically sound (i.e., achievement test scores)? These questions, and others, are simply questions that come from having an understanding of education in general and gifted programs in particular and the often political nature of the gifted identification process.
Conclusion
It is important to recognize that the issues raised above are not issues that Bui and colleagues (2011) had any control over but rather are concerns regarding district policies and procedures and the implementation of them. However, given these practical concerns, it is imperative that research investigations in gifted education consider the direct impacts that its practices have on students. It is also important that such investigations acknowledge the limited generalizations of findings. In this particular study’s (Bui et al., 2011) examination of programming effects, it appears that generalizations are suggested about an entire field from only one district’s gifted program data. However, broadly speaking, this study’s findings warrant the broader field ensuring that all programs are based upon policies and procedures that ensure rigor and challenge rather than simply “privilege,” and that those policies and procedures are carried out in a responsible manner.
From a programmatic perspective, the Bui et al. (2011) study can serve as a beginning dialogue among practitioners, policymakers, and researchers about the real meaning of best practices and how imprecision in decisions about programming, identification, and services has direct impact on students whose needs are not or will not be addressed by the regular classroom curricula.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
