Abstract
This article describes the statistical methods used in quantitative and mixed methods articles between 2006 and 2010 in five gifted education research journals. Results indicate that the most commonly used statistical methods are means (85.9% of articles), standard deviations (77.8%), Pearson’s r (47.8%), χ2 (32.2%), ANOVA (30.7%), t tests (30.0%), and MANOVA (23.0%). Approximately half (53.3%) of the articles included reliability reports for the data at hand; Cronbach’s alpha was the most commonly reported measure of reliability (41.5%). Some discussions of best statistical practice and implications for the field of gifted education are included.
From time to time, it is necessary for those in a scientific field to take stock of the recent research in order to assess its content and to better direct future studies. Researchers in the field of gifted education seem to be particularly disposed to this internal assessment in recent years. For example, Dai, Swanson, and Cheng (2011) analyzed more than 1,200 empirical studies, published over the course of 12 years, to determine the strength of the link between theory and research and to identify major trends in research among researchers of the gifted. Similarly, Coleman, Guo, and Dabbs (2007) examined 124 nominally qualitative studies published in gifted education journals between 1985 and 2003. Surprisingly, the authors found that only 40—just under one third—were fully consistent with the qualitative paradigm.
Another review of gifted education articles was performed by Parker and her colleagues in which they analyzed articles published in four major gifted education journals between 2001 and 2006 (Parker, Jordan, Kirk, Aspiranti, & Bain, 2010). They found that almost half of the articles (46%) were narratives (e.g., a description of a successful gifted program). The next largest group of articles was quantitative research reports, which comprised 35% of published work. About 16% of articles were qualitative in nature and 3% fell into multiple categories. The authors also found that, of the quantitative research articles, most were nonexperimental group comparisons or correlational studies.
A fundamental recent review of gifted education research was conducted by Matthews et al. (2008), in which they evaluated the reporting of effect sizes in five gifted education journals from 1996 to 2005. In their study, Matthews et al. found that effect size reporting had been increasing in recent years. Nevertheless, effect sizes were never reported in more than 60% of quantitative articles in a given year. This proportion may continue to increase because in 2006 the then-new editors of the Journal of Advanced Academics began requiring authors to report effect sizes for studies, where appropriate (Matthews et al., 2008; McCoach & Siegle, 2009). Since the publication of their assessment of effect size reporting, other researchers have written works aimed at gifted education researchers in strong support of reporting effect sizes (e.g., Fidler, 2010; Gentry & Peters, 2009).
Current Review of Literature
In this article, we undertook a similar review of published journal articles in gifted education to determine (a) which statistical methods are used in gifted education research, (b) whether some statistical methods are more common in certain journals, and (c) whether there are discernible trends in the use of various statistical methods over time. In a way, this review is an extension of the Matthews et al. (2008) article. Instead of concentrating solely on effect sizes, however, we have considered all types of statistical procedures to better understand the current state of the field’s use of statistics.
Reviews of statistical procedures are published periodically. The earliest we have been able to find was by Edgington (1964), who examined statistical methods used in six leading psychology journals from 1948 to 1962. Edgington found that, in that time, the use of analysis of variance (ANOVA) increased sharply—from 11% to 55%—and the use of the t test declined from 51% to 19%. Correlational methods decreased somewhat from 42% to 24%, whereas nonparametric and factor analysis methods increased roughly fivefold. In a follow-up article published a decade later, Edgington (1974) found that the use of ANOVA had continued to increase until it was found in 71% of articles, whereas the t test continued its decline and was found in only 12% of articles published in 1972. Nonparametric methods declined in the early 1970s, whereas the use of other methods stayed approximately constant.
After Edgington’s efforts, other authors began to examine the statistical methods used in other journals in education and psychology. Willson (1980) examined research techniques published from 1969 to 1978 in the American Educational Research Journal. Similar to Edgington’s results, Willson found that ANOVA was the most common statistical technique, although in his review it was found in only 56% of articles. For Willson, Pearson’s r correlation was also the second most common technique, followed by multiple regression.
More recent reviews have found that the use of classical statistics such as ANOVA and correlation has declined, likely due to the explosion of statistical techniques and the ease of access to personal computers, which make performing complex statistical procedures much easier than they were in Edgington’s time. Bangert and Baumberger (2005), for example, found that correlation was the most common statistical method used in the Journal of Counseling & Development between 1990 and 2001, yet correlations were found in only 12% of articles. Multiple regression was the second most common method and was found in 8% of articles. MANOVA/MANCOVA, which occurred in 7% of research studies, was the third most common method. Similarly, Kieffer, Reese, and Thompson (2001) evaluated articles from the American Educational Research Journal and Journal of Counseling Psychology from 1988 and 1997 and found that the most common methods were (in descending order) ANOVA, correlation, multiple regression, factor analysis, and MANOVA.
These and other reviews of statistical procedures were themselves reviewed and combined into one large review of techniques published by Skidmore and Thompson (2010), which spanned 12,012 articles published between 1948 and 2001. The authors found that the use of statistical procedures tends to wax and wane with time and that there is rarely a constant upward or downward trend for any particular statistical method. For example, in education articles, the use of factor analysis was increasing during the 1960s and 1970s, but decreased in the 1980s before increasing again in the 1990s. Skidmore and Thompson also found a large degree of variability in the use of any particular statistical technique from year to year. Despite these caveats, it was apparent that factor analysis was increasing in use in psychology and that the use of t tests was in decline in both education and psychology.
Conducting reviews of methodological techniques is an important exercise for a variety of reasons. First, such reviews help researchers in a specific field learn what methodological training is necessary for a person to understand the quantitative research conducted in that field. Second, reviews shed light on subtle methodological trends that may be occurring in a field over the span of years. Finally, statistical reviews permit researchers to evaluate how well their field compares with the accepted standards of statistical reporting.
Procedures
Before examining the statistical methods used in articles, it was necessary for us to establish what was important to code. For guidance, we turned to several sources. The first was previous reviews of statistical methods in education and psychology (i.e., Bangert & Baumburger, 2005; Baumberger & Bangert, 1996; Edgington, 1964, 1974; Elmore & Woehlke, 1988; Kieffer et al., 2001; Skidmore & Thompson, 2010; Willson, 1980). We also consulted published standards of statistical reporting from the American Psychological Association (2010; APA Publications and Communications Board Working Group on Journal Article Reporting Standards, 2008; Wilkinson & the Task Force on Statistical Inference, 1999) and the American Educational Research Association (AERA; 2006). Finally, we consulted works on meta-analysis, which often provide important insight on the type of information that should be reported in original articles (e.g., Cooper, 2010; Lipsey & Wilson, 2001).
Because it would be exceptionally difficult (and unwieldy) to code all the information suggested in the previous sources, we focused on statistical methods (including effect sizes) and reliability. For the statistical methods, we recorded every statistical procedure that the original article’s author(s) carried out during the course of the study. Statistical methods were not recorded if they were mentioned during the course of the author’s literature review or if they were merely suggested in the discussion or conclusion sections.
Because many statistical procedures used in gifted education research are interrelated through the General Linear Model (Cohen, 1968a; Thompson, 2006), it is possible to code some in multiple categories. For example, a regression with a dichotomous dependent variable could be coded as “logistic regression” or “multinomial regression” because subjecting the same data to both procedures would produce identical results. In coding statistical procedures, we always classified a method under the most specific name possible. Readers should also note that a particular statistical method was only classified once. For example, this regression with a dichotomous dependent variable was only classified as “logistic regression.” However, many authors used multiple statistical procedures within the same article; all these were coded as being used in the same article. Finally, if a statistical method was used more than once in a particular article, it was counted once in our results.
A wide variety of reliability information could potentially be coded during the review of statistical procedures. However, we decided to focus on a few of the most important aspects of reliability, namely, whether the authors reported reliability information for their own data and the type of reliability statistic reported. We believed that recording whether the authors reported their own reliability information was important because previous research had shown that the practice of reliability induction—where a researcher takes a reliability statistic from one sample and applies it to another—is distressingly common (e.g., Deditius-Island & Caruso, 2002; Kieffer & Reese, 2002; Shields & Caruso, 2004; Vacha-Haase, Kogan, & Thompson, 2000). Warne (2011) called reliability induction “inherently erroneous” (p. 675) and said that the practice was “as logically defensible as would be an attempt to take the mean from one sample and apply it to data collected from a completely different sample” (p. 676). We attempted to code instances of reliability induction, but found that the judgments were too subjective and that there was little agreement among coders. Therefore, we instead recorded whether the author(s) of an article made it clear that the reliability statistics they reported applied to the data at hand (and not some other sample, such as the standardization sample from the test manual).
We examined articles from five major journals that frequently publish research on giftedness and gifted education. The journals were (in alphabetical order) Gifted Child Quarterly (GCQ), High Ability Studies (HAS), the Journal of Advanced Academics (JAA; this journal was called Journal of Secondary Gifted Education until late 2006), the Journal for the Education of the Gifted (JEG), and Roeper Review (RR). All articles published between 2006 and 2010 in these journals were included in the study. The articles were divided among the four authors and each article was read independently by two of the authors of this study. The two coders then met and discussed the results. When differences occurred, the original article was always consulted and agreement reached. The first author is a quantitative psychologist and the other three authors mastered the most common univariate and multivariate statistical methods and passed a rigorous course on structural equation modeling (SEM).
Results
Table 1 shows that in total we reviewed 697 articles. Of these, we determined that 234 (33.6%) were quantitative research reports—a proportion very similar to what Parker et al. (2010) found in their review of gifted education articles from 2001 to 2006. Ninety-nine articles (14.2%) were qualitative research reports and 36 (5.2%) were reports from mixed methods research projects. The remaining 328 articles (47.1%) were not research reports at all (i.e., editorials, theoretical pieces, and other nonresearch articles). Table 1 also shows the number and percentage of each article type that was published in each journal and each year. Only quantitative and mixed methods articles were analyzed further. The results of the analyses are displayed in Tables 2 and 3.
Number and Types of Articles Published in Gifted Education Journals, Organized by Journal and Year, 2006-2010
Note. GCQ = Gifted Child Quarterly; HAS = High Ability Studies; JAA = Journal of Advanced Academics; JEG = Journal for the Education of the Gifted; RR = Roeper Review.
Statistical Methods Used in Gifted Education Journals, 2006-2010, Organized by Journal
Note. GCQ = Gifted Child Quarterly; HAS = High Ability Studies; JAA = Journal of Advanced Academics; JEG = Journal for the Education of the Gifted; RR = Roeper Review; CFA = confirmatory factor analysis; PCA = principal components analysis; HLM = hierarchical linear modeling; IRT = item response theory; ICC = intraclass correlation coefficient; ROC = receiver operating characteristic curve; BIC = Bayesian information criterion; EFA = exploratory factor analysis; ANOVA = analysis of variance; ANCOVA = analysis of covariance; MANOVA = multivariate analysis of variance; MANCOVA = multivariate analysis of covariance. The following miscellaneous statistical methods were used only once: sensitivity/specificity values (GCQ), likelihood ratio (GCQ), correct classification rates (GCQ), Tobit regression (GCQ), simulated data generation (GCQ), unspecified nonparametric correlation (HAS), difference in correlation (HAS), self-organizing map (HAS), Huynh–Feldt correction (HAS), Puri and Sen’s L statistic (HAS), IRT-based statistics (HAS), latency and amplitude (HAS), Monte Carlo bootstrapping (HAS), Mardia’s coefficient (HAS), probability ratio (HAS), Fisher’s r-to-z transformation (JAA), reliability change index (JAA), test of heteroscedasticity (JAA), test of autocorrelation (JAA), EM single imputation (JAA), Shapiro–Wilk test (JAA), test of homogeneity of slopes (JEG), Q factor analysis (RR), Hotelling’s multivariate T (RR), false discovery method (RR), Grizzle–Starmer–Koch method (RR), and latent class analysis (RR).
Statistical Methods Used in Gifted Education Journals, 2006-2010, Organized by Year
Note. CFA = confirmatory factor analysis; PCA = principal components analysis; HLM = hierarchical linear modeling; IRT = item response theory; ICC = intraclass correlation coefficient; ROC = receiver operating characteristic curve; BIC = Bayesian information criterion; EFA = exploratory factor analysis; ANOVA = analysis of variance; ANCOVA = analysis of covariance; MANOVA = multivariate analysis of variance; MANCOVA = multivariate analysis of covariance. The following miscellaneous statistical methods were used only once: sensitivity/specificity values (2008), likelihood ratio (2008), correct classification rates (2008), Tobit regression (2010), simulated data generation (2010), unspecified nonparametric correlation (2009), difference in correlation (2006), self-organizing map (2006), Huynh–Feldt correction (2007), Puri and Sen’s L statistic (2007), IRT-based statistics (2010), latency and amplitude (2008), Monte Carlo bootstrapping (2009), Mardia’s coefficient (2009), probability ratio (2010), Fisher’s r-to-z transformation (2007), reliability change index (2008), test of heteroscedasticity (2008), test of autocorrelation (2008), EM single imputation (2008), Shapiro–Wilk test (2010), test of homogeneity of slopes (2010), Q factor analysis (2010), Hotelling’s multivariate T (2006), false discovery method (2006), Grizzle–Starmer–Koch method (2007), and latent class analysis (2010).
Descriptive Statistics
The vast majority of authors of quantitative and mixed methods articles reported descriptive statistics for their data. Means were reported for 85.9% of articles and standard deviations for 77.8% of articles. We never found an instance where a standard deviation was reported without an accompanying mean. However, means were often reported by themselves without accompanying standard deviations. This is largely recognized as poor practice (e.g., Thompson, 2006) and makes articles harder to include in meta-analyses (Cooper, 2010).
t Tests and -OVA Methods
The simplest parametric inferential statistic is the t test, which examines whether the means of two groups are statistically different. We found that t tests were quite common in the gifted education literature and occurred in a total of 30.0% of articles. The next simplest parametric inferential statistic is ANOVA, which compares multiple group means to determine if one or more are statistically significantly different from the other(s). ANOVA was slightly more popular than t tests among gifted education researchers, with 30.7% of articles using ANOVA. Analysis of covariance (ANCOVA)—which is the same as an ANOVA, except with a covariate included in the statistical model, was found in a total of 11 articles (3.9%) and was therefore not very common.
The multivariate extensions of ANOVA and ANCOVA—called multivariate analysis of variance (MANOVA) and multivariate analysis of covariance (MANCOVA)—were less common than their univariate counterparts. MANOVA was used in 23.0% articles; MANCOVA was found in only two articles (0.7%).
Often statistically significant -OVA methods are followed up by post hoc tests to determine exactly where the difference(s) in groups lie (Bray & Maxwell, 1985; Stevens, 2002). For ANOVA and ANCOVA, the most common post hoc tests were the Tukey test (11 articles), post hoc t tests (9 articles), and the Scheffé test (3 articles). Dunnett’s t and the Student–Newman–Lewis post hoc tests were each found in a single article. Authors of six articles indicated that they ran post hoc tests after their ANOVA or ANCOVA, yet did not specify which post hoc test(s) were performed.
By far the most common post hoc statistical test for MANOVA or MANCOVA was a series of ANOVAs, which occurred in 31 articles—almost half of all articles that used MANOVA or MANCOVA. The other specified post hoc tests after MANOVA or MANCOVA were Tukey’s test (6 articles), discriminant analysis (5 articles), ANCOVA (2 articles), t tests (2 articles), the Games–Howell test (1 article), and the Roy–Bargman test (1 article). Authors of six articles that conducted multivariate null hypothesis statistical significance testing did not specify which post hoc test(s) they conducted.
Correlational and Regression Procedures
The most basic correlational procedure is Pearson’s r, which examines the statistical relationship between two interval or ratio scaled variables. This technique was the most common statistical method we found in the gifted literature, after means and standard deviations—making an appearance in 47.8% of all quantitative or mixed methods articles. Other correlational techniques were much less common; Cramér’s V was found in 11 articles (4.1%), Spearman’s ρ and the point–biserial correlation were each found in five articles (1.9%), and the phi correlation was found in three articles (1.1%). Kendall’s tau and the tetrachoric correlation were each found in a single article (0.4%).
Multiple regression—an extension of Pearson’s r applied to situations with multiple independent variables—was a commonly found regression procedure and appeared in 40 articles (14.9%). Similar to the situation with correlational procedures, other regression procedures were comparatively rare: logistic regression was found in seven articles (2.6%), canonical correlation appeared in four articles (1.5%), stepwise regression was found in six articles (2.2%), ordinal regression was found in two articles (0.7%), and multinomial logistic regression was found in a single article (0.4%).
Complex Statistical Methods
With the increased access to personal computers in the past few decades, the methodological literature has exploded with new techniques, reevaluations of old techniques, and new procedures to make existing statistical techniques easier to implement. This increased access to statistical methods has had an impact on gifted education research.
Hierarchical linear modeling
Hierarchical linear modeling (HLM) is a technique that compensates for situations where data are clustered into groups, such as in a clustered sampling plan (see Ferron et al., 2008; Hox, 2002; McCoach, 2010a, 2010b; McCoach & Adelson, 2010; Raudenbush & Bryk, 2002; Warne et al., in press). HLM is not commonly found in the gifted education literature; the method was only used in eight articles (3.0%)—all of which were published in GCQ or JAA.
Data reduction methods
Data reduction methods are designed to create new synthetic variables that summarize a set of observed variables in a more parsimonious manner (Thompson, 2004). There are two main types of data reduction, exploratory factor analysis (EFA) and principal components analysis (PCA), which largely produce similar results (O’Connor, 2000). We found that the gifted education research community favors both methods approximately equally well: EFA was used in 18 articles (6.7%), and PCA was used in 17 articles (6.3%). Authors of four articles used some type of data reduction method but were not clear about which type of data reduction method was used.
Whether one chooses to perform an EFA or a PCA, a variety of subjective decisions come into play (Costello & Osborne, 2005; Thompson, 2004). One of the initial judgments is to choose a rotation method to produce interpretable results. We found that varimax rotation was the most common method used in gifted education and was found in 20 articles. The next most common methods specified were oblimin (6 articles), promax (5 articles), direct oblimin (2 articles), and quartimax (1 article). In eight articles the author(s) did not specify their rotation method.
Quantitative methodologists generally agree that the most important decision that someone makes when conducting an EFA or PCA is the number of factors to retain (Costello & Osborne, 2005; Larsen & Warne, 2010; Zwick & Velicer, 1982, 1986). Given the importance of the factor retention methods, we were disappointed to find that gifted education researchers did not specify the method of retaining the number of factors in their EFA or PCA in 18 articles. This surpassed even the most commonly specified method for determining the number of factors—the Guttman rule (also called the K1 rule, or eigenvalues-greater-than-one rule), which was used in 16 articles. The other factor retention methods used were the scree test (15 articles), the variance accounted by the number of factors (8 articles), the Kaiser–Meyer–Olkin test of sampling adequacy (5 articles), a priori theory (4 articles), parallel analysis (4 articles), Bartlett’s test of sphericity (3 articles), and visual inspection of the item loadings (1 article). It is important to note that some authors used more than one factor retention method when making their decision.
A result of EFA or PCA is the factor loading matrix and—in the case of nonorthogonal rotation methods—a structure matrix. We found that 23 articles that used a data reduction method reported a loading matrix; only 5 structure matrices were reported.
Path analysis
We found that path analysis was a rarely used statistical method in the recent gifted education literature. Only four articles (1.5%) from the past 5 years were found to have used path analysis.
Latent variable methods
Latent variable methods are statistical methods that use observed data to hypothesize relationships among variables—some of which are not directly observable. For the purposes of this review, we defined three types of latent variable models: confirmatory factor analysis (CFA) and measurement models, structural models, and tests of invariance. CFA may be one of the most common advanced statistical methods used in gifted education, and CFA/measurement models appeared in 24 articles (8.9%). Structural models were less common, being found in only 5 articles (1.9%) from the past 5 years. Tests of invariance—which often use SEM and CFA—were conducted in 5 articles (1.9%).
Like data reduction methods, CFA, SEM, and tests of invariance require researchers to report a substantial amount of details in order for their procedures to be properly evaluated. One reporting requirement is for researchers to report a covariance matrix of all observed variables in a model or a correlation matrix with accompanying means and standard deviations (Kline, 2005). Authors of only nine articles met this reporting requirement.
Leading authorities on SEM (e.g., Kaplan, 2000; Kline, 2005) recommend that users report the estimation method used by their statistical software package when working with latent variable models. We found that the most common identification method was maximum likelihood (ML; 9 articles), followed by weighted least squares mean/variance adjusted (2 articles), and robust ML (1 article). In the majority of articles that used a latent variable model—17 out of 29—the estimation method was not reported.
One critical assumption of latent variable methods when using ML estimation is multivariate normality (Kaplan, 2000; Kline, 2005). Therefore, it is expected that researchers who use ML estimation for their latent variable methods either check their data for violations of the normality assumption or use methods that compensate for a lack of normality. Authors of only eight articles using latent variable models reported these procedures.
Of the 29 articles that used latent variable models, results (i.e., factor loadings and other estimated parameters) were only reported for 18 articles. Of these, 17 were standardized results and 1 had unstandardized results.
Finally, researchers who use latent variable models have a wide variety of fit statistics with which they can evaluate how closely their observed data fit their latent variable models (see Sun, 2005, for an excellent introduction to fit statistics). The most common fit statistic reported was the comparative fit index (CFI), which was found in 24 articles. The χ2 and root mean square error of approximation (RMSEA) were tied as the second most commonly reported fit statistic, each appearing in 23 articles. Other reported fit statistics were the standardized root mean square residual (SRMR; 12 articles), goodness-of-fit index (GFI; 11 articles), nonnormed fit index (NNFI; 7 articles), expected cross-validation index (ECVI; 5 articles), Tucker–Lewis index (TLI; 4 articles), adjusted goodness-of-fit index (AGFI; 3 articles), normed fit index (NFI; 2 articles), root mean residual (RMR; 2 articles), the Satorra–Benter adjusted χ2 (2 articles), weighted root mean residual (WRMR; 2 articles), relative fit index (RFI; 1 article), robust CFI (1 article), and the robust RMSEA (1 article).
Finally, some authors who use latent variable models are interested in modifying their models to better fit their data. Although there are many methods for doing this, the most empirical approach is to use modification indices to guide the decision of where to add paths on the model (Kline, 2005). Modification indices were used in four articles (1.5%) in the 5 years of gifted education research that we investigated.
Nonparametric Statistics
Most of the statistical methods that we have mentioned so far are parametric methods based on assumptions of normality and associated probability distributions. Nonparametric statistics, on the other hand, do not require these assumptions and are appropriate for a wider variety of research situations, despite not being as common as traditional parametric statistics (Agresti, 2007; Heiman, 2012).
The most common nonparametric null hypothesis statistical significance testing method was the Mann–Whitney U test, which was found in 7 articles (2.6%). Other nonparametric tests used in the gifted education research were Fisher’s exact test (4 articles; 1.5%), the Wilcoxon signed ranks test (4 articles; 1.5%), cluster analysis (4 articles; 1.5%), the Kolmogorov–Smirov test (3 articles; 1.1%), and the Kruskall–Wallis, log likelihood, and Kendall’s W tests, all of which appeared in a single article (0.4%).
Reliability
Recent authoritative publications have dictated that researchers should report the reliability statistics from the data at hand (AERA, 2006; Wilkinson & the Task Force on Statistical Inference, 1999). We found that authors of 144 articles (53.3%) unambiguously reported reliability data from their own sample. By far the most common measure of reliability was measures of internal consistency reliability (Cronbach’s alpha or KR20 statistics), which were reported in 112 articles (41.5% of all quantitative or mixed methods articles reported). The widespread use of Cronbach’s alpha in gifted education research is also found in the general psychological literature (Hogan, Benjamin, & Brezinski, 2000).
Interrater reliability measures were the next most common type of reliability data reported. Correlations between ratings were the most common (17 articles; 6.3%) type of interrater reliability data, with Cohen’s (1968b) kappa (13 articles; 4.8%), percentage of agreement (8 articles; 3.0%), and unknown interrater reliability measures (2 articles; 0.7%) also appearing in the literature.
Other types of reliability data were less commonly reported. Authors of 11 articles (4.1%) reported test–retest reliability coefficients. Split-half correlations were reported for four articles (1.5%), with two of those correlations being corrected by the Spearman–Brown prophecy formula (Brown, 1910; Spearman, 1910). The standard error of measurement was reported in three articles (1.1%). Item response theory–based statistics of reliability were reported in two articles (0.7%), and parallel forms reliability and the conditional standard error of measurement were reported in one article (0.4%) each.
Miscellaneous Statistical Procedures
There were a variety of other statistical procedures that did not easily fit into other categories used thus far in this article. The most common miscellaneous statistics reported were intraclass correlations (13 articles; 4.8%), Box’s M (8 articles; 3.0%), Levene’s test of homogeneity of variances (6 articles; 2.2%), replication procedures (4 articles; 1.5%), Bayesian statistics (4 articles; 1.5%), and power analysis (3 articles; 1.1%). Two articles (0.7%) reported the use of the receiver operating characteristic (ROC) curve, Mahalanobis distance, design effect, multicollinearity diagnostics, and Bayesian information criterion (BIC). Statistical procedures that were used in only one article are listed in the footnotes of Tables 2 and 3. It is important to note that some of these miscellaneous statistical procedures are used in conjunction with some of the other methods described elsewhere in this article. For example, Levene’s test is often used to examine the homogeneity of variances assumption in an ANOVA or a t test.
p Values
Another miscellaneous reporting standard we examined was the frequency of exact p values being reported. We counted how many articles reported at least one exact p value and how many articles reported at least one inexact p value. Based on guidelines from APA (2010), “exact p values” were defined as p values that were equal to a specific number, and not just a range (i.e., p < .01 was defined as an inexact p). The exceptions to this guideline were those permitted in the APA manual, namely, if p is less than .001, or if in a table it would be inefficient and cumbersome to include exact p values for all reported statistics (such as in a correlation table). These situations were counted as “exact p values.” We found that overall 78.9% of articles reported at least one exact p value, whereas 61.1% of articles reported at least one inexact p value. As these percentages indicate, there were many authors that reported a mix of exact and inexact p values.
We also counted the number of articles stating that p was equal or less than zero. Reported values of p = .000 or p < .000 are impossible. Our review of gifted education articles indicated that 33 articles—12.2%—reported a p value of zero or less. Interestingly enough, between 2006 and 2010, authors of articles in the Journal of Advanced Academics never reported an impossible p value.
Effect Sizes
One of the major developments in the past 25 years in educational and psychological research is the increased reporting of effect sizes. The shift in the thinking of psychological researchers about effect sizes can be seen in the changes in APA’s instructions to authors on the topic. In 1994, the fourth edition of APA’s style manual gave an “encouragement” (1994, p. 18) to authors to report effect sizes. Five years later, the members of the APA Task Force on Statistical Inference said that authors should “always present effect sizes for primary outcomes” (Wilkinson & the Task Force on Statistical Inference, 1999, p. 599). By 2001, the fifth edition of the APA manual stated, “For the reader to fully understand the importance of your findings, it is almost always necessary to include some index of effect size” (p. 25, italics added). The current edition of the APA manual (2010) included very similar language to that found in the 2001 manual. AERA (2006) has recently published recommendations to use effect sizes in quantitative research, although the instructions from AERA are not as strongly worded as those from APA.
Table 4 shows that the most commonly reported effect size in the gifted education literature we reviewed was η2, which was found in 86 articles (31.9%). There were two other widely popular effect sizes in the gifted education literature: Cohen’s d (75 articles; 27.8%) and r2/R2 (60 articles; 22.2%). No other effect size was reported in more than 10 articles between 2006 and 2010. The details of effect size reporting are displayed in Table 4.
Effect Sizes Reported in Gifted Education Journals, 2006-2010, Organized by Journal and Year
Note. GCQ = Gifted Child Quarterly; HAS = High Ability Studies; JAA = Journal of Advanced Academics; JEG = Journal for the Education of the Gifted; RR = Roeper Review. The following effect sizes were used only once: nonparametric D (HAS, 2009), Δ (HAS, 2006), ξ2 (RR, 2006), τ (RR, 2006), and the percentage of subjects classified correctly (RR, 2006).
Includes pseudo-R2 values calculated from both logistic regression and hierarchical linear modeling methods.
Overall, 68.1% of all quantitative and mixed methods published articles reported at least one effect size. This is a remarkable increase compared with Matthews et al.’s (2008) review in which they found that effect size reporting did not exceed 60% in any year between 1996 and 2005 for any gifted education journal. At least half of all quantitative and mixed methods articles in all the journals reported effect sizes, and for JAA the proportion exceeded three quarters (76.6%). Moreover, our estimates are rather conservative because we calculated this percentage based on all quantitative and mixed methods articles, whereas Matthews et al. eliminated articles that they decided did not need effect sizes reported. Had we done the same, these percentages would surely have increased.
Discussion
In this article, we examined the statistical methods used in gifted education journals. As Tables 2 and 3 show, classical parametric statistics methods are found overwhelmingly in quantitative and mixed methods articles published in gifted education journals. Overall, this trend is consistent with other recent reviews of statistical methods used in education and psychology journals (e.g., Bangert & Baumberger, 2005; Baumberger & Bangert, 1996; Kieffer et al., 2001). Compared with Skidmore and Thompson’s (2010) aggregate examination of the use of statistical methods in the 1990s, gifted education researchers seem to rely more on t tests and Pearson’s r correlation than other fields of education. Compared with authors of articles in psychology journals, gifted education authors use multiple regression less often, and ANOVA and t tests more often.
Differences Among Journals
The choice of statistical method seems to be mostly independent of where the article is published. There are a few exceptions to this general trend, though. JAA seems to be the venue of choice for authors who use more advanced statistical methods. For example, multiple regression was used in more than one quarter of JAA articles, where HAS—the journal that published articles using multiple regression at the second highest rate—did so in just 17.7% of articles. JAA also published all the articles in gifted education that used logistic regression between 2006 and 2010. However, GCQ also published a large share of articles that used advanced statistical methods, such as HLM and path analysis. It is also important to note that GCQ and RR also published noticeably more articles that used a χ2 statistical significance test than the other journals.
Trends in Statistical Methods
The only major trend that we observed and found noteworthy was a decline in the use of t tests in recent years. In 2006 and 2007, t tests appeared in 34.6% and 37.7% of articles, respectively. In 2008, 2009, and 2010, though, t tests were used in 28.6%, 24.6%, and 25.0% of quantitative and mixed methods articles. It would be interesting to follow up on this study in the future and determine whether this trend continues.
We believe that there were two problems with identifying trends in the use of statistical methods. First, because we only examined 5 years of published articles, it is possible that not enough time has passed for identifiable trends to emerge. Second, very few articles used advanced and nonparametric methods, which made even minor fluctuations produce apparently large changes in their frequency of use. For example, HLM was used in a total of eight articles between 2006 and 2010; none of those articles were published in 2007 and half of those articles were published in 2010. It is not clear right now whether HLM was more frequently used in later years because it is actually becoming more common, or whether the relatively rare use of HLM means that from time to time a year will occur when a slightly higher number of HLM articles are published.
Other Findings
As we conducted this study, we noticed several tendencies among the researchers who produced quantitative articles in gifted education. One of the most noticeable was how they report effect sizes. Although we are impressed that more gifted education researchers are reporting effect sizes, we question whether researchers understand the effect sizes they are reporting. Dozens of authors, for example, merely cited Cohen’s (1988) benchmarks and labeled their effect sizes as “large,” “medium,” or “small” and provided no further interpretation. However, Cohen himself stated several times in his landmark 1988 book (e.g., pp. 12, 25, 79, 113, 147, 184, 224, 285, 413, 478, 532) that his benchmarks should only be used when there was little or no previous research on the topic at hand and should not apply to all social science research situations. Moreover, the “large” effects can be trivial and “small” effects can be critically important (Gentry & Peters, 2009; Thompson, 2006). Durlak (2009) elaborated on these points, saying:
Now that thousands of studies and meta-analyses have been conducted in the social sciences, Cohen’s (1988) general conventions do not automatically apply. Moreover, assuming that “large” effects are always more important than “small” or “medium” ones is unjustified. It is not only the magnitude of effect that is important, but also its practical or clinical value that must be considered. (pp. 922-923)
In light of these ideas, we urge researchers in gifted education to stop using Cohen’s benchmarks, unless there is no previous research on their topic, and to instead provide interpretations and judgments of the meaning of their effect sizes. This will bring researchers’ practices in line with modern quantitative psychologists’ recommendations for interpreting effect sizes (e.g., Cooper, 2010; Durlak, 2009; Gentry & Peters, 2009; Thompson, 2006).
We are also concerned with the post hoc tests that gifted education researchers use after determining that their ANOVA or MANOVA tests are statistically significant. Despite how common the -OVA methods are found in gifted education research, we have serious doubts about whether many researchers understand how to interpret such statistical tests, especially MANOVA and MANCOVA. General practice is to follow a statistically significant ANOVA or ANCOVA with a specifically designed post hoc test, such as Tukey’s or Scheffé’s test (Heiman, 2012; Thompson, 2006). Yet in nine articles authors followed their statistically significant ANOVA or ANCOVA with t tests, often with an accompanying Bonferroni correction. We believe that conducting a series of post hoc t tests defeats the purpose of the original ANOVA. Similarly, following a MANOVA or MANCOVA with a post hoc ANOVA—which was the most common post hoc technique in the articles we examined—instead of the recommended discriminant analysis (e.g., Bray & Maxwell, 1985; Thompson, Diamond, McWilliam, Snyder, & Snyder, 2005) often sidesteps the issue of whether the groups are equal on combinations of dependent variables, which is the main purpose of conducting an MANOVA in the first place.
We also believe that some gifted education researchers have misconceptions about p values obtained from null hypothesis statistical significance tests. As stated above, p values equal to 0 or less were reported in 12.2% of articles. However, this is an impossible statistical result because p is a probability value, which must always be positive (hence, p being less than 0 is impossible). Moreover, a p value of zero would imply that the sample does not belong to the population to which the researcher wishes to generalize, and therefore, the statistical test should not be conducted at all. We think that there is no excuse for such a basic error in reporting quantitative results. On the other hand, we applaud the editors of JAA for their vigilance in preventing a single impossible p value from being reported.
The results that we found from analyzing the decisions made by authors who used data reduction methods also gives us cause to pause. The results of the review indicated that varimax rotation method was used most frequently in gifted education articles and that the most common factor retention method named was the Guttman (1954) rule, where all factors with an eigenvalue greater than one are retained. Although varimax rotation has its place, it requires factors to be completely uncorrelated—an unrealistic assumption in many situations in education and psychology. Also, the Guttman rule has been long established as one of the least accurate guides for determining how many factors to retain in an EFA or PCA (e.g., Hakstian, Rogers, & Catell, 1982; Velicer, Eaton, & Fava, 2000; Zwick & Velicer, 1982, 1986). We believe that the fact that both these methods are the default settings on SPSS is not a coincidence and may indicate that some researchers who use data reduction methods like EFA and PCA do not understand how to make the subjective decisions required to properly conduct these statistical analyses.
Finally, we were disappointed in the appearance of six articles with stepwise regression in the gifted education literature. Stepwise regression is a notoriously poor statistical method with pervasive, intrinsic problems (e.g., Huberty, 1989; Thompson, 1995, 2001, 2006). In fact, stepwise regression is so problematic that there is even one published article titled, “Why Won’t Stepwise Methods Die?” (Thompson, 1989). Stepwise methods are problematic for three main reasons: they use the wrong degrees of freedom when calculating F obs (and therefore artificially drive down p values), capitalize on sampling error, and do not produce the best R2 possible (Thompson, 1995). We strongly urge researchers in gifted education to completely abandon stepwise regression.
Implications
We believe that this summary of statistical methods used in gifted education articles has several important implications for the field. First, this article can give guidance to those who train the next generation of gifted education researchers and practitioners in choosing which statistical methods their graduate students should master to understand the field’s literature. After conducting this research study, we believe that a professional in the field should be able to master classical parametric methods (i.e., t tests, ANOVA, correlation, regression) and basic multivariate methods—especially MANOVA, which was found in almost one quarter of articles. Based on our experience and education, this sequence of statistics would likely take three semesters, or about a year and a half. However, it has been found that in psychology, only doctoral students in quantitative psychology are typically required to take this many statistics courses (Aiken, West, & Millsap, 2008). Therefore, the typical graduate of a doctoral program in psychology is not able to comprehend or evaluate large portions of the quantitative research that is being published in leading gifted education journals. Although we were not able to find similar information for education or gifted education programs, we doubt that three semesters of statistics is a common requirement for education doctoral students.
Another implication that arises from this article is the editorial policies of the journals we examined. To our knowledge, only JAA requires authors to report effect sizes where appropriate; it is also likely an editorial policy at JAA that p values never be reported as 0 or as being less than 0. We suggest that the editors of the other journals that we reviewed adopt these policies and that all journal editors in gifted education examine their statistical editorial policies to encourage better statistical reporting practices. We also suggest that journal editors be savvy in a wide variety of statistics so that they can fulfill their role as gatekeepers of the published literature and determine whether statistical methods are being used and reported correctly.
We also believe that this article raises further questions about where research in gifted education is headed. Although there are no clear trends in the use rate of multivariate statistics, we believe that the notable percentages of articles using advanced statistical methods like SEM, HLM, and others indicates that at least a portion of the research community is embracing these methods. We believe that this is a positive development because there are some research questions that can only be answered through the use of complex methods. Although there may be a certain elegance in a correlation or a t test of two group means, reality is messy and complex and sometimes that reality requires statistical methods that are similarly complex to understand it. We hope that more researchers will embrace complex statistical methods so that the psychological and social phenomena that accompany giftedness may be better understood in the future.
Limitations
As with all research, there are some limitations to our statistical review of the literature. First, despite our efforts to give our coding system as much integrity and quality as possible, it was inevitable that the real world of published literature would not conform seamlessly to our coding system. This unavoidably led to some subjective judgment in the classification of articles and in interpreting the authors’ descriptions of their methodological and statistical procedures. For example, we were often forced to make a subjective judgment about whether the authors were reporting the reliability information for their own data. Although we consulted test manuals and previous research that used the same psychometric instrument, it was sometimes necessary to decide whether it was likely or probable that the authors were reporting their own reliability data. We believe that by having two different people read each article and requiring agreement on all aspects of the coding that our subjective judgments are justified.
Second, if other researchers were to conduct a similar review of statistical methods, they would likely produce different results. Indeed, Skidmore and Thompson (2010) found that occasionally in other reviews of statistical methods the same volumes of journals would be analyzed by different authors who produced different results. Nevertheless, we believe that having two coders improved our results and that any other analyses of the same volumes of the journals would produce similar findings.
An anonymous reviewer brought to our attention a final limitation of this study, which stems from the fact that statistical methods are merely tools in the research process. With rare exceptions (e.g., stepwise regression), we do not believe that any statistical methods are good or bad. Rather, statistical methods should be used if they are appropriate for the research question at hand and if the data meet the required assumptions. This investigation of statistical methods was inherently acontextual and we made few judgments about whether these methods were being used appropriately. As stated earlier, complex statistical methods are sometimes necessary to answer complex research questions; similarly, sometimes even the simplest statistical methods may be the best option for testing theories or hypotheses.
Conclusion
Overall, this review of the literature indicates that gifted education researchers share some similarities in their use of statistical methods with researchers in other areas of psychology and education. Broadly speaking, gifted education researchers favor classical parametric methods over other statistical tools for analyzing their data. However, a notably large portion of articles used multivariate methods— especially MANOVA. We hope that our findings and our discussion of the results will spur a methodological conversation among quantitative researchers that will help researchers conduct better research and gain greater insight into giftedness.
Footnotes
The second, third, and fourth authors all contributed to the study equally; they are listed alphabetically.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article:
Partial funding for this research was provided by the Lohman/Heep Fellowship at Texas A&M University.
