Abstract
Even as the importance of replication research has become more widely understood, the field of gifted education is almost completely devoid of replication studies. An area in which replication is a particular problem is in student identification research, since instrument validity is a necessary prerequisite for any sound psychometric decision. To begin to address this issue, our study sought to replicate the internal validity structure of three teacher rating instruments. The goal was to determine whether data gathered using these instruments fit their published internal validity structures. Results indicated all three instruments failed to meet traditional fit criteria, but to varying degrees, and that further replication or instrument revision are needed before these instruments can be used with confidence.
Introduction
A key tenant to research in any field is that for any theory to be true, it must be supported by multiple, ideally independent replications. If any field is to make advancements while discarding errant or spurious findings, replication is essential. Unfortunately, replication is not happening. In a 2012 paper by Makel, Plucker, and Hegarty, the authors found that just over 1% of all research published in the top five impact-factor journals in psychology actually engaged in replication. Given the scope of academic publishing and the sheer number of papers published annually, this lack of replication translates to many papers being published that are erroneous at best or outright fraudulent at worst (John, Loewenstein, & Prelec, 2012). Authors such as Ioannidis (2005) have gone as far as to claim that the majority of published research findings are false. To be clear, replication research is not only important because of unethical behavior. The very nature of Null Hypothesis Significance Testing means that some papers will be published showing significant findings, even when these findings are due completely to chance. This is why Makel and colleagues (2012), Ioannidis (2005), and McBee and Matthews (2014) in their editorial guidance for this journal have called for more replication research as well as a wider acceptance of replication as just as important as original research findings. To quote a seminal paper by Makel and Plucker (2014), “facts are more important than novelty” (p. 304), and facts require replication.
Literature Review
The field of gifted education is no better with regard to the prevalence of replication research. In 2015, Makel and Plucker found that only four of 1,945 total articles published in Gifted Child Quarterly between 1957 and 2012 engaged in replication—a rate of .21%. Journal for the Education of the Gifted was only slightly better at .92%. Within the body of work known as gifted education, student identification stands out as one of the top four topics most often published about in scholarly articles (Dai, Swanson, & Cheng, 2011). Given the wide variation in identification practices across states (National Association for Gifted Children [NAGC], 2015) and some recent work suggesting just how problematic common identification systems are (McBee, Peters, & Miller, 2016; McBee, Peters, & Waterman, 2014), the identification domain of inquiry appears prime for replication work.
The NAGC’s (2015) most recent State of the States Report presented “multiple measures” as the most common form of student identification across the United States. However, the report also noted that the most common instance in which these multiple measure identification systems are implemented is following a nomination by a teacher or parent. Similarly, a recent national survey of elementary gifted education programs found that 86% of responding districts relied on teacher nominations in their identification protocols (Callahan, Moon, & Oh, 2013). Clearly, some form of teacher input is a significant variable for student identification in a large number of districts.
Research by McBee et al. (2016) showed that three criteria are necessary for successful two-stage gifted identification systems (those that utilize a screening/nomination and a confirmation phase): (a) a strong correlation between phases or measures, (b) high instrument reliability, and (c) inclusive cut scores on the nomination phase. In their analyses of some common identification practices, large percentages of students were missed, even under generous assumptions. However, all of their analyses assumed the instruments or phases in question were of high quality—strong construct validity evidence, a strong theoretical basis, and that they were being used appropriately for their purpose. No imperfection in these components was included in their models. If any of the instruments used turned out to not be measuring what they are supposed to be measuring (i.e., have content validity problems), then the flaws in an identification system will be exacerbated. If any such content validity problems do exist, then research or practice based on these instruments is also likely to be flawed, further impeding the pursuit of science and the larger development of talent.
Teacher nominations are one place where the field of gifted education has seen replication research applied to positive effect. In 1959, Pegnato and Birch found that teachers were not efficacious or efficient when it came to nominating students as mentally gifted. In this case, the confirmation assessment was a score in the top 1% of an individual intelligence test. Pegnato and Birch’s article has been cited more than 260 times, often as evidence of the flaws in using teacher ratings or nominations for the purposes of identification. However, in 1994, Gagné published a reanalysis of the Pegnato and Birch (1959) data and found that (a) the authors had used inappropriate methods in their original analyses; and (b) that upon reexamination, teachers came across no worse than the other measures analyzed in the original Pegnato and Birch study. When replication and this kind of work checking does not happen, erroneous findings are assumed to be correct and are used as the basis for additional research, often for decades, all of which is then based on a flawed premise.
Of the more than 30 existing teacher rating instruments, the four highest quality instruments (based on published psychometric information) are the Scales for Rating the Behavioral Characteristics of Superior Students (SRBCSS; Renzulli et al., 2010), the Gifted Rating Scales (GRS; Pfeiffer & Jarosewich, 2003), the Scales for Identifying Gifted Students (SIGS; Ryser & McConnell, 2004), and the HOPE Teacher Rating Scale (HOPE Scale; Gentry, Peters, Pereira, McIntosh, & Fugate, 2015). Both the SRBCSS and the SIGS were also noted in the report by Callahan and colleagues (2013) as being two of the most commonly used across the country.
Purpose
The purpose of this study was to evaluate the degree to which data collected with three teacher rating instruments (the HOPE Scale, the SIGS, and the GRS) replicated their published models via confirmatory factor analysis (CFA) methods. Unless an instrument can yield data that correspond to its intended theoretical structure, then it cannot possibly yield valid data on the larger construct that it is supposed to measure, resulting in decreases to identification system sensitivity as well as increases in the incorrect identification rate.
Hypothesis
Data collected using three published teacher rating scales will meet established criteria for model fit in support of prior research on each instrument.
Overall, this study set out to replicate past findings regarding the internal structure of three published teacher rating scales in response to calls from Makel (2014) and Makel and Plucker (2015) for more replication research in gifted education.
Method
Instruments
The rating scales used in this study were the SIGS, the GRS, and the HOPE Scale. Although the SRBCSS is also a long-standing and popular teacher rating instrument, because of its length and the time required for teachers to complete it, the school district used in our data collection refused to include it. Although the SIGS, GRS, and HOPE Scale fall under the class of assessments known as teacher rating scales, which use a student’s teacher as the data source, they differ in their theoretical foundation, conceptual definitions of “giftedness” measured, length, and body of previous research. Below is a brief overview of each instrument’s research base and background.
GRS
The GRS consists of Preschool (GRS-P) and School (GRS-S) forms for use with students ages 4 years to 6 years 11 months and 6 years to 13 years 11 months, respectively. The GRS-S includes 72 items across six subscales (Intellectual Ability, Academic Ability, Creativity, Artistic Talent, Leadership Ability, and Motivation), with 12 items per subscale. The GRS subscales were selected having the federal report National Excellence: A Case for Developing America’s Talent (U.S. Department of Education [USDOE], 1993) as a basis and also through review of other rating scales, input from gifted education experts, and a review of the gifted education literature. The standardization sample for the GRS-S consisted of 600 students. Internal consistency estimates ranged from .97 to .99, test–retest reliability ranged from .83 to .97, and interrater reliability ranged from .64 to .79 (Pfeiffer & Jarosewich, 2003, 2007). Although the authors suggest that a factor analysis and evaluation of bias were conducted (Pfeiffer & Jarosewich, 2003), results of these analyses were not included in the test manual. A study separate from the test manual, but conducted by the GRS authors (Pfeiffer & Jarosewich, 2007), found “no age or race/ethnicity differences on any of the scales” (p. 39) based on follow-up analyses conducted on the instrument’s standardization sample of 592 students, 379 of whom were Caucasian. These findings were based on mean score comparison methods (i.e., ANOVA)—methods that French and Finch (2006) noted are not appropriate for assessing across-group equivalency. As to prevalence, Callahan et al. (2013) found that approximately 9% of elementary gifted programs nationwide utilized the GRS in student identification. Ward (2005) reviewed the GRS and although she indicated that the validity information provided in the technical manual is generally adequate, she also suggested that including results of factor analyses would be recommended.
The HOPE Scale
The HOPE Scale (Gentry, Peters, et al., 2015) differs from the other two instruments included in this study in that it is much shorter (11 items) and more general in that it focuses on broad academic and social components of giftedness as opposed to content-specific factors such as mathematics or science. Information regarding reliability and validity evidence is reported in the HOPE Scale Manual (Gentry, Pereira, Peters, McIntosh, & Fugate, 2015) and in two articles (Peters & Gentry, 2010, 2012) and includes information related to student subgroup comparisons and internal validity structure. Peters and Gentry (2012) found no differential item functioning (DIF) related to student race, ethnicity, or family income in a sample of 1,700 K-12 students. However, DIF was present when gender groups were compared. Racial- or ethnic-specific CFAs of internal validity found comparative fit index/Tucker–Lewis index (CFI/TLI) values of .90 to .96 with higher values reported for data from Caucasian students and the lowest values for Hispanic students. Chi-square difference tests were not significant for equal form, factor loading, or latent mean tests. For income group comparisons, CFI/TLI values were .96/.95 for paid lunch students and .93/.91 for free or reduced-price lunch (FRPL) students. However, chi-square difference tests were not significant for equal form and equal factor loading tests. They were significant for equal latent mean tests indicating low-income students did receive lower average scores than their paid lunch peers, but it was unclear whether this was due to actual bias or simply to actual differences in observed levels of the measured behaviors. As with the GRS studies reported above, research on the HOPE Scale has been conducted solely by the instrument’s developers or affiliated colleagues. Sandilos (2017) and Sullivan (2017) reviewed the HOPE Scale and commended the HOPE Scale authors on the brevity of the instrument and on the evidence of validity reported in the technical manual and peer-reviewed articles using the scale; however, both reviewers indicated that additional reliability evidence (especially test–retest and interrater reliability) should have been reported.
SIGS
The SIGS includes 84 items across seven factors (General Intellectual, Language Arts, Mathematics, Science, Social Studies, Creativity, and Leadership). The seven factors were designed to reflect the federal definitions of giftedness (USDOE, 1993). The SIGS technical manual (Ryser & McConnell, 2004) provides information on the development and standardization of the SIGS, including the development of norms and reliability evidence. Average alpha internal consistency estimates are presented for general and gifted samples and ranged from .90 to .98. As a measure of interrater reliability, Ryser and McConnell calculated the correlation between ratings from teachers and parents, and those ranged from .43 to .59. Test–retest reliability ranged from .58 to .93. Ryser and McConnell provide information on four types of validity: convergent; discriminant; discriminant item functioning for ethnicity, race, gender, and gifted status; and interfactor correlation. No information is provided on factor analyses conducted on the SIGS or how well the data collected for initial validation of the instrument fit the theoretical model. The manual does reference the analysis of DIF within the scales comparing African American versus non-African American, Hispanic versus non-Hispanic, and male versus female students. Using a form of logistic regression, the authors tested an earlier version of the instrument and then removed any items that showed significant differences between the two groups being tested. The statement is then made that “the SIGS is considered to be an unbiased measure of students’ strengths” (p. 32). No details are included regarding the further details of this evaluation nor is information provided regarding income group comparisons or a more general evaluation of internal validity structure beyond alpha reliability levels. We were unable to locate any peer-reviewed articles evaluating the SIGS, although Matthews (2007) and Ward (2007) published reviews of the SIGS in The Seventeenth Mental Measurement Yearbook (Geisinger, Spies, Carlson, & Plake, 2007). Both reviewers indicated that the validation of SIGS involved comprehensive statistical analyses, but also that results are not presented in enough detail in the test manual. The reviewers expressed concerns regarding the sample used in the validation of SIGS, which included information only on three categories for race (i.e., White, African American, and Other) and only approximately 5% of gifted students in the “other” category. The same national survey referenced above showed that 9% of schools nationwide use the SIGS for gifted student identification (Callahan et al., 2013).
Sample
According to recommendations by Lohman (2006), gifted identification should be based on local norms and should not be concerned with national norm comparisons. Following this philosophy, we recruited a single, diverse school district based on size (approximately 25,000 K-12 students), diversity (53% non-Caucasian), and family income (49% on Free or Reduced Price Meals). This racial, ethnic, and income make-up closely mirrored the demographic representation of the United States. The large number of schools in the district (n = 32 elementary schools) was also important for a range of contexts to be sampled. Once district administration agreed to participate and institutional review board approval was received, eight elementary schools were purposively selected for inclusion. Elementary grades were the focus because these were the common grade levels shared by all three instruments. Schools were selected if they were more diverse or had more low-income students than the district average (labeled diverse schools) or if they had less than the district average (nondiverse schools). The idea was to create matched pairs for each instrument such that each instrument was evaluated in diverse and nondiverse settings with instruments randomly assigned to schools for use and students randomly assigned to have instruments completed about them. Some of the schools had lower enrollment than their respective pair. To address this imbalance, two additional schools were added to the first six to make a total of eight participating schools. The idea was for the original sampling pools for each instrument to be as similar as possible. In the case of all three instruments, more low-income students or students from ethnic groups that have been traditionally underrepresented in gifted programs (i.e., Hispanic, African American) were rated than were students from high income or Asian or Caucasian families. Table 1 presents information related to the schools and their students. Note that the GRS sample included some students who were not included in the analyses because they were rated with the GRS-P in too small of numbers to be analyzed as a separate group. Thus, the analyses conducted in this study related solely to the GRS-S.
Participating Schools Size and Instrument Assignment.
Note. FRPL = free or reduced-price lunch; GRS = Gifted Rating Scales; SIGS = Scales for Identifying Gifted Students.
In addition to the rationale presented above, schools were also purposively cluster sampled as opposed to randomly sampled at the student or teacher levels due to the nature of the school setting and how such ratings would typically be used. It would have been unrealistic to assign some teachers within a school one instrument while assigning a different instrument to the teacher next door. By having teachers at each school use only one instrument, this issue was avoided and mirrored real-world application. All general education classroom teachers at each of the eight schools (n = 139) were involved in the study. A total of five students were randomly selected from each of the K-5 classes at each of the buildings, and copies of the assigned instrument were then sent to the classroom teacher along with instructions. This assignment process was used to imitate the use of these rating scales as a universal screening tool. Although some schools may use these instruments after a generic nomination (and as part of a confirmation phase), McBee et al. (2014; McBee et al., 2016) noted that this practice is highly problematic and should be avoided. Teachers were not given any special training beyond a brief cover letter asking them to answer each of the instrument’s questions based on the assigned students. Instruments were required to stand on their own without additional instruction or training to best imitate a typical school setting in which few teachers have training or expertise in gifted education but are often asked to rate or nominate students for gifted services (NAGC, 2015). The above-described methods led to a final sample size of 735 K-5 students (229 GRS, 302 HOPE, and 204 SIGS) rated by 139 teachers across eight buildings. See Figure 1 for a visual representation of the sample used in this study.

Actual sample used in analyses.
Completed rating scales were then de-identified and sent to the first author who entered the data into separate databases for further analysis as described below.
Data Analysis
To evaluate the internal validity structures, a CFA was conducted on each instrument followed by group-specific comparisons for family income and ethnicity. This was done to closely mirror common past published analyses for all three instruments based on what limited data were available in each instruments’ respective test manual or outside publications. All analyses were conducted using Mplus Version 7.4. The published instrument models evaluated came directly from each instrument’s test manual. For the HOPE Scale, a two-factor model was specific across the 11 items (six and five items each on the academic and social scales; Peters & Gentry, 2012). For the SIGS, a seven-factor model was specified for the 84 items (12 items on each subscale; Ryser & McConnell, 2004). The GRS-S model was specified with six factors for its 72 items (12 items on each subscale; Pfeiffer & Jarosewich, 2003). The specific item factor models can be found in each instrument’s test manual. Based on recommendations by Finney and DiStefano (2006) regarding structural equation models of Likert-type/ordered categorical data, maximum likelihood estimation was used for all analyses after skewness and kurtosis values were checked for relative normality (see Tables 2, 4, and 6). Degrees of freedom for each model are presented in Tables 3, 5, and 7.
Descriptive Statistics for the GRS-S (n = 189).
Note. GRS-S = Gifted Rating Scales–School.
Correlation is significant at the .01 level (two-tailed).
Fit Indices for General and Subgroup CFA for the GRS-S.
Note. All analyses with bootstrapping and maximum likelihood estimation. CFA = confirmatory factor analysis; GRS-S = Gifted Rating Scales–School; CFI = comparative fit index; TLI = Tucker–Lewis index; RMSEA = root mean square error of approximation; CI = confidence interval; SRMR = standardized root mean square residual; FRPL = free or reduced-price lunch.
Underrepresented groups included students from African American, Latino/a, and Mixed Race Families.
Overrepresented groups included students from Caucasian and Asian American families.
Because of the large number of items present in the SIGS and GRS and the resulting small ratio of students rated to parameters, Bollen and Stein’s (1992) Modified Bootstrap estimation was used in all of the CFAs. Although bootstrapping has been shown to yield closer fit indices to actual population parameters (Bollen & Stein, 1992; Fan & Thompson, 1998), it also tends to underestimate the magnitude of most fit statistics, particularly when sample sizes are small—a fact that will be discussed later. Because of this, fit indices, such as CFI, TLI, and root mean square error of approximation (RMSEA), gained from bootstrapped analyses will show artificially worse fit (e.g., CFI, TLI, RMSEA). For the present study, this was the case for all fit indices except chi-square and standardized root mean square residual (SRMR), which Fan and Thompson found to suggest artificially better fit.
In addition to conducting general confirmatory factor analyses on the full samples, single income group comparisons (FRPL and non-FRPL) and ethnic/racial group comparisons (i.e., groups traditionally overrepresented and underrepresented in gifted programs) were conducted. For the purposes of this study, underrepresented racial/ethnic groups were operationalized as Hispanic and African American and overrepresented as Caucasian and Asian (Yoon & Gentry, 2009). Again, this was done because some level of bias or group-specific comparisons were included in each instruments’ test manual.
Incremental (e.g., CFI, TLI) and absolute (e.g., RMSEA, SRMR) fit indices were used to determine the degree of fit for each instrument’s model to its respective data to test our hypothesis. Although chi-square values are also presented in the CFA tables, they were not the primary means of comparison due to the wide variation in instrument and data complexity (i.e., the SIGS and GRS have far more complex models and data than the HOPE Scale) and sample sizes. Overall, the closer that RMSEA and SRMR values are to zero, the better the data fit the intended theoretical model, with values <.08 being indicative of “good” fit (Hu & Bentler, 1999). Furthermore, CFI and TLI values closer to 1.0 indicate stronger fit to the data with values above .90 indicative of acceptable model fit (Brown, 2006; Hu & Bentler, 1999). The SRMR fit index was especially important in the present study because, unlike many other measures of fit, it includes no penalty for model complexity and thus allows for clearer comparisons across the various instruments. The one important downside to the SRMR value—especially in the case of small sample sizes and bootstrapped estimates—is that it is positively biased (Kenny, 2015) and therefore likely to yield inflated values in the present study.
Results
Results are presented for each instrument separately. In addition, whenever possible, results for the current study are compared with results from previous studies, including those referenced in the technical manuals for each of the instruments. This is imperfect because not every instrument conducted the same analyses or included the same level of detail in their test manuals.
GRS
Table 2 includes descriptive statistics and interfactor correlations for the six subscales of the GRS-S. Interfactor correlations for the five subscales on the GRS-S ranged from .606 to .962, which are considered moderate to strong positive correlations. Alpha internal consistency estimates for the GRS-S ranged from .966 to .994. These values are comparable with the internal consistency estimates reported in the GRS manual, which ranged from .97 to .99 (Pfeiffer & Jarosewich, 2003).
Table 3 presents the fit indices for the GRS-S. General CFA results with the full sample of 189 students indicated poor model fit, with CFI and TLI estimates below .90. However, RMSEA (.09) indicated passable model fit, and SRMR (.03) indicated good model fit. Fit indices for the FRPL students (CFI = .84, TLI = .84, RMSEA = .010) and non-FRPL students (CFI = .67, TLI = .66, RMSEA = .057) indicated poor fit. SRMR for the FRPL and non-FRPL students indicated poor to acceptable model fit. Comparison of the underrepresented and overrepresented groups yielded similar goodness-of-fit results with CFI and TLI estimates well below .90, indicating poor model fit and RMSEA estimates greater than .111 for both groups. One interesting finding is that, despite a larger sample size (108 vs. 81), the chi-square values were actually smaller for the underrepresented group analysis than the overrepresented group analysis (5810 vs. 6509)—a finding which again points to race or ethnicity as less significant than would be expected given the history of underrepresentation in the field. This suggests that race was not much of a factor at all with regard to how well the data fit the GRS-S intended theoretical model and that the data better replicated the model for underrepresented students than for those from overrepresented racial/ethnic groups. Because the GRS technical manual did not include CFA results, comparisons with CFA results of validation were not possible. However, the poor model fit across all groups used in our analyses is reason for concern.
HOPE Scale
The HOPE Scale includes only two subscales (Academic and Social), and the interfactor correlation for the sample used in this study was .792. That result is similar to previous studies using the HOPE Scale (Pereira, 2011; Peters & Gentry, 2010). Estimates of internal consistency of .94 for the Academic subscale and .897 for the Social subscale were slightly lower than the estimates from previous studies (Pereira, 2011) and the HOPE Scale technical manual (Gentry, Pereira et al., 2015). These data are presented in Table 4.
Descriptive Statistics for the HOPE Scale (n = 296).
Note. HOPE Scale= HOPE Teacher Rating Scale.
Correlation is significant at the .01 level (two-tailed).
Table 5 presents the results of the general CFAs as well as the CFAs for each income group and ethnic/racial group. The HOPE Scale CFAs yielded goodness-of-fit estimates that are generally in the range of acceptable (i.e., CFI > .9, TLI > .9) to good (SRMR ≤ .05) model fit. It is clear that the fit indices for the HOPE Scale were closer to traditional criteria (Brown, 2006; Hu & Bentler, 1999) as measured by the incremental fit indices (CFI = .93, TLI = .91). These model fit estimates are comparable to the results reported in the HOPE Scale technical manual (RMSEA = .101, CFI = .96, goodness-of-fit index [GFI] = .91), though in places still do not meet traditional thresholds for goodness of fit (e.g., RMSEA).
Fit Indices for General and Subgroup CFA for the HOPE Scale.
Note. All analyses with bootstrapping and maximum likelihood estimation. CFA = confirmatory factor analysis; HOPE Scale= HOPE Teacher Rating Scale; CFI = comparative fit index; TLI = Tucker–Lewis index; RMSEA = root mean square error of approximation; CI = confidence interval; SRMR = standardized root mean square residual; FRPL = free or reduced-price lunch.
Underrepresented groups included students from African American, Latino/a, and Mixed Race Families.
Overrepresented groups included students from Caucasian and Asian American families.
Results of the income group comparison for the HOPE Scale show slightly better fit with the non-FRPL student data. CFIs were .94 for the non-FRPL group versus .91 for the FRPL group, and the TLI of .89 for the FRPL group indicates a lack of fit. The HOPE Scale not only showed acceptable fit for the underrepresented-specific CFAs but also showed extremely high RMSEA values. Some of this lack of fit could be due to the negative biasing of CFI/TLI fit statistics and positive biasing of SRMR fit statistics that can occur when bootstrapping is used (Bollen & Stein, 1992; Fan & Thompson, 1998). Overall, the GFIs for the HOPE Scale were acceptable and fairly similar to the fit statistics reported in previous studies. The exceptions were the RMSEA estimates, which were well above the cutoff for good model fit and TLI (.89) and SRMR (.06) for the FRPL group, which failed to meet the criteria for acceptable fit.
SIGS
Table 6 includes descriptive statistics for the SIGS. Interfactor correlations ranged from .677 to .881, which were generally stronger than the correlations reported by Ryser and McConnell (2004) in the SIGS technical manual for the general normative sample (.49 to .81). Internal consistency estimates for the SIGS ranged from .948 to .981 for the sample used in the present study, and Ryser and McConnell reported alphas ranging from .93 to .97 for the general normative sample.
Descriptive Statistics for SIGS (n = 204).
Note. SIGS = Scales for Identifying Gifted Students.
Correlation is significant at the .01 level (two-tailed).
Unfortunately, all of the CFI, TLI, and RMSEA values for the SIGS for the different groups included in our analyses failed to meet traditional criteria for acceptable model fit. Table 7 includes results of general and group comparison CFAs. Comparison of SIGS CFAs for the FRPL and non-FRPL groups showed similarly poor model fit across the two groups—neither was worse than the other, but all failed to meet traditional fit criteria. CFI and TLI estimates ranged from .58 to .69, and RMSEA estimates were .111 for FRPL students and .138 for non-FRPL students. SRMR of .06 (FRPL) and .07 (non-FRPL) were the only GFIs considered acceptable. CFI for underrepresented and overrepresented students (i.e., race/ethnicity comparison) were .59 and .71, respectively. Comparable values were found for other fit indices (e.g., TLI, RMSEA), and SRMR was .06 for both the underrepresented and overrepresented groups.
Fit Indices for General and Subgroup CFA for the SIGS.
Note. All analyses with bootstrapping and maximum likelihood estimation. SIGS = Scales for Identifying Gifted Students; CFI = comparative fit index; TLI = Tucker–Lewis index; RMSEA = root mean square error of approximation; CI = confidence interval; SRMR = standardized root mean square residual; CFA = confirmatory factor analysis; FRPL = free or reduced-price lunch.
Underrepresented groups included students from African American, Latino/a, and Mixed Race Families.
Overrepresented groups included students from Caucasian and Asian American families.
Ryser and McConnell (2004) did not report factor analyses results in the SIGs technical manual, so direct comparison of GFIs is not possible.
Discussion
General Model Fit
With regard to this paper’s hypothesis, data gathered from all three instruments failed to reach established criteria for fit to their respective models. This means that the theoretical models published with each instruments were not reproduced by the data gathered. Some of these findings could be partially explained by the use of relatively small samples and bootstrap methods that can yield negatively biased fit indices (Bollen & Stein, 1992; Fan & Thompson, 1998). However, given how far some of the SIGS and GRS tests were from meeting traditional fit criteria, sample size and bootstrapping were unlikely the sole cause of the lack of fit. In the case of the SRMR values, the sample size and bootstrapping positive biasing may have made the difference as many of the values were much closer to the criteria for acceptable fit (Hu & Bentler, 1999).
It is important to note that although the HOPE Scale data better fit its intended theoretical model, it also includes only two factors with a total of 11 items. This is important as many traditional fit indices are influenced either directly or indirectly by model complexity (CFI values include a penalty for every parameter estimated). As CFA evaluates how well an instrument’s accompanying model fits the data obtained when that instrument is used empirically, model complexity is a factor. Unfortunately, regardless of the cause, a lack of model fit to the data, which was indicated to various degrees for all three instruments, means the instruments did not function as designed and therefore have internal validity issues that could prevent them from yielding valid data regarding gifted identification decisions (Brown, 2006). Put simply, the results mean that, to varying degrees, the instruments do not measure their intended constructs.
Model Fit by Income and Racial/Ethnic Groups
In addition to the general CFA results for each instrument, we presented model fit estimates for income (FRPL vs. non-FRPL students) and racial/ethnic groups (students from underrepresented vs. overrepresented populations), because each instrument had some level of prior research base related to subgroup analyses. For the HOPE Scale, fit indices comparing the two income or racial/ethnic groups showed very similar results to past research. Comparing the present results to those from past studies, for the two racial/ethnic groups, SRMR values were both .05, TLI were both .90, and RMSEA values were equally problematic at .124 and .133. The SIGS and GRS did not show the same similarity of fit across groups. For example, for the SIGS, the overrepresented student fit statistics were far better (CFI = .71, RMSEA = .107) than for the underrepresented students (CFI = .59, RMSEA = .138). The GRS was actually the opposite with underrepresented groups showing better overall fit. Unfortunately, a lack of across-group fit is still a problem regardless of which group fit better.
The HOPE Scale also had TLI estimates indicating nonadequate model fit (i.e., <.90) for the FRPL group. This particular lack of similarity of fit due to income is consistent with previous research on the HOPE Scale (Peters & Gentry, 2012) and makes sense given research regarding gifted and talented students from low-income families (Wyner, Bridgeland, & Diiulio, 2009). CFI and TLI estimates for both the SIGS and GRS indicated poor model fit. For example, CFI estimates for the non-FRPL group were .60 versus .69 for the FRPL group, and for the GRS, CFI were .67 versus .84 for high-income and low-income students, respectively. Similar disparities in fit were observed for the other fit indices. For the SIGS and the GRS, these results pointed to the data from the low-income students fitting the instruments’ models better than did the data from the high-income students. This same finding was not observed with the HOPE Scale. However, for income group comparisons, all three instruments failed to reach acceptable fit criteria for RMSEA values and were mixed for SRMR values. The lack of fit of the SIGS contradicts previous research on the SIGS, which stated the instrument was bias free (Ryser & McConnell, 2004). We know of no prior research on the GRS that evaluated income group differences. The consistency in these results across all three instruments points to income as a more influential factor in teacher ratings than race or ethnicity, at least with regard to data fit. The degree to which this indicates a form of DIF versus students demonstrating different levels of gifted-related observable behaviors because of fewer educational opportunities remains unknown. These findings could indicate a type of bias on the part of all three instruments, or they could simply mean that students from low-income homes, due to fewer educational opportunities (Peters & Engerrand, 2016), demonstrate fewer behaviors related to giftedness than do their high-income peers and that those observable behaviors (or lack thereof) translated into different scores on published teacher rating scales.
The small lack of fit similarity across race/ethnicity and income groups is consistent with that which was found with the HOPE Scale (Peters & Gentry, 2012) and is dissimilar with research on the GRS (Pfeiffer & Jarosewich, 2007), which found no race or ethnic differences. Although the current study did not conduct mean-difference tests as they are an inappropriate method for bias evaluation involving latent constructs (French & Finch, 2006; Thompson & Green, 2006), the lack of similarity of fit would be a substantial barrier to making valid inferences. The present study also contradicts research presented in the SIGS technical manual (Ryser & McConnell, 2004) which showed earlier biased items were removed before publishing the version used in this study.
Conclusion
Data from teachers are the most common catalyst through which students are identified for gifted education programs or services (NAGC, 2015). Given this fact, the importance of evaluating any instruments or procedures related to teacher nominations or ratings would seem to be self-evident. For any instrument to yield valid data regarding its intended constructs, there must be evidence of internal validity and that this validity holds true across various student subgroups, and that internal validity needs to be independently replicated in various contexts to yield high levels of confidence. What the results of this study have shown is that even some of the long-standing and most rigorously developed instruments still fail to meet traditional guidelines for model fit and this could limit the ability of these instruments to yield valid data regarding who is in need of gifted and talented interventions. If the results from the present study are accurate, then the use of any of the three studied instruments would contribute to decreased identification accuracy and increased rates of false negatives and incorrect identifications (McBee et al., 2014). This study is an example of the kind of replication research recently called for in psychology (Makel, 2014) and in gifted education (Makel & Plucker, 2015) to continue to test and further refine the existing knowledge base regarding psychological concepts and phenomena. A wider evaluation of across-group validity evidence as well as further refinement of each instruments’ model is warranted before the instruments used in this study are further utilized for gifted student identification.
Limitations
This study includes a few important limitations. First, one school district was used as a data source. Although this school district was chosen to be as representative to the general U.S. population as possible, this is still a limitation, and further replication is necessary. The size of the sample was also problematic. Although random sampling was used to assure that the sample was closely representative of the schools included in the sampling frame, ideally, structural equation modeling methods should use large samples. The size of the samples used in this study, even after bootstrapping was used to correct for some of the bias, could have yielded less accurate results. With further regard to sample, this study only looked at specific groups that are often underrepresented. It did not include English Language Learners, students with disabilities, or Native American students (referenced earlier). Finally, as noted earlier, all three instruments do measure different constructs or conceptual models of giftedness even though they are all meant to help identify gifted and talented students. This fact as well as different degrees of model complexity may have contributed to the results of the study.
Footnotes
Authors’ Note
The raw data for this paper may be obtained from the authors’ Open Science Framework webpage at: osf.io/x39c5
Declaration of Conflicting Interests
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors of this study were part of the team that developed the HOPE Scale and have published prior research on it.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Funding for this study came from the University of Wisconsin System Institute on Race and Ethnicity.
