Abstract
Eta-squared (η2) and partial eta-squared (ηp2) are effect sizes that express the amount of variance accounted for by one or more independent variables. These indices are generally used in conjunction with ANOVA, the most commonly used statistical test in second language (L2) research (Plonsky, 2013). Consequently, it is critical that these effect sizes are applied and interpreted appropriately. The present study examined the use of these two effect sizes in L2 research. We begin by outlining the statistical and conceptual foundation of and distinction between η2 and ηp2. We then review the use of these indices in a sample of published L2 research (K = 156). Among other results, we show that ηp2 values are frequently being mislabeled as η2. We interpret and discuss potential causes and consequences related to the confusion surrounding these related but distinct indices. Within the context of reform efforts in quantitative L2 research, the current study seeks to respond to the recent, pointed calls for improving study quality (Plonsky, 2014) and statistical literacy (Loewen et al., 2014) in the field.
I Introduction
It has been almost three decades since Cohen (1988) wisely noted that ‘a moment’s thought suggests that it [effect size] is, after all, what science is all about’ (p. 532). With this position in mind, some have gone as far as to argue that failing to appropriately report estimates of effect sizes amounts to ‘a kind of withholding of evidence’ (Grissom and Kim, 2012: 9). In the case of second language (L2) research, however, effect sizes are still a relatively novel concept. Historically, the field has relied very heavily on statistical significance and null hypothesis significance testing (p values; see Norris, 2015; Plonsky 2015). It is only in the last decade or so that we have seen a shift in favor of effect sizes and practical significance, which can be attributed both to influential advocates (e.g. Norris and Ortega, 2000; Plonsky and Oswald, 2014) and journal editors. We know of at least eight L2 journals that now require effect sizes to be included in reports of quantitative research: Foreign Language Annals, Language Learning, Language Learning and Technology, Language Testing, Modern Language Journal, Second Language Research, Studies in Second Language Acquisition, and TESOL Quarterly.
Two of the most commonly employed effect sizes are eta-squared (η2), and partial eta-squared (ηp2), which are used in conjunction with ANOVA and its variants. We have chosen, therefore, to examine these two effect sizes in terms of how they are reported and interpreted in L2 research. We are concerned that, in an era of point-and-click analyses (see discussion in Mizumoto and Plonsky, 2016), choices regarding effect sizes and other statistical results may be made based on program defaults rather than on an accurate understanding of the data. This is particularly likely to occur in the case of effect sizes, which, despite their presence in published L2 research, are not generally well understood. The result of an overreliance on statistical packages together with the relative lack of detailed knowledge about effect sizes carries the risk of erroneous reporting, mislabeling, and faulty interpretations.
The present study builds on the momentum surrounding methodological reform in applied linguistics, including concerns expressed in recent years over, for example, study quality (Plonsky, 2013, 2014) and statistical literacy (Loewen et al., 2014). Such discourse responds to repeated calls for examining how ‘APPROPRIATELY’ (e.g. Lazaraton, 2009: 415; emphasis in original) different statistical concepts are being employed.
II Eta squared and partial eta squared as applied in ANOVA models
Although L2 researchers often report effect sizes such as eta-squared (η2), such values are rarely accompanied by much in the way of interpretation (Plonsky and Oswald, 2014). One reason for this is that there appears to be a good deal of confusion surrounding the terminology of ‘proportion of variance’ (Grissom and Kim, 2012: 181) effect sizes. Therefore, for the purposes of clarity, some conceptual explanation of what these indices express is warranted.
Consider, for example, a study wherein the researcher was interested in analysing the effect of an experimental treatment across conditions (e.g. condition 1, condition 2, control). Conceptually, the focus of analysis is on group differences as regards the dependent variable (e.g. usually a measure of L2 knowledge or learning). Statistically speaking, scores on the dependent variable scores also contain the amount and source of variance caused by treatment effects (Thompson, 2006).
Proportion of variance effect sizes in the η2 family partition the amount of total variation in the dependent variable (DV; e.g. knowledge as measured) to determine how much of the variation is separately accounted for or explained by each independent variable (i.e. explained sum of squares or SOS). Also taken into account by the η2 family is how much of the DV variation is left unexplained (i.e. unexplained or error SOS). Thus, total variation in the DV can be described in terms of explained and unexplained variance.
We suspect that, despite the now-widespread reporting of eta-squared (η2), many L2 researchers may not be aware of the differences among its variants, most notably partial eta-squared (ηp2). Further complicating this matter is the mislabeling of η2 and ηp2 by certain early versions of SPSS, the most frequently used statistical software package in L2 research (Loewen et al., 2014). The likelihood of this error in the context of L2 research is supported by evidence presented in other fields. Levine and Hullett (2002) and Pierce et al. (2004) found widespread misreporting and misinterpretation of η2 and ηp2 in published studies in communication and psychology, respectively. Both studies also cite the mislabeling of ηp2 values as η2 in early versions of SPSS.
In the two sections that follow, we provide a brief overview of these two important effect size indices. We also illustrate the points being made with heuristic examples.
1 Classical eta-squared
Imagine an intervention study in which four treatment conditions are compared on a single dependent variable. To examine the relationship of interest here, we would likely use a one-way ANOVA. The effect size in this case, η2 (also called the squared correlation ratio), is computed using Kerlinger’s (1964) classical formula (p. 203) as:
Note that in one-way designs, there is only one independent source (SOSA; treatment) of variance to explain some portion of the total variation in the dependent variable (SOSTOTAL; L2 knowledge). The numerator of the effect size estimator then represents variability that is attributable to the only independent variable we have (e.g. treatment condition). Therefore, an η2 of, say, 0.35 (or 35%), indicates that we can account for 35% of the total variation in L2 knowledge as measured. The rest of the total SOS remains unexplained (i.e. SOSError = 65%), and may be due to individual differences, measurement error, or any number of other factors.
2 Partial eta-squared
One-way designs can certainly be found in L2 research. However, designs with multiple independent variables are likely much more common due to the multivariate nature of L2 learning, knowledge, use, and so forth (Brown, 2015). In such cases, the conceptual approach embodied by η2 can be extended to apply to multi-way or factorial ANOVA. However, we now may have multiple sources of independent effects leading to a distinction that must be drawn between the classical η2 and partial η2 (e.g. Bakeman, 2005; Richardson, 2011).
Building on the example from above, imagine a 3 × 4 design in which proficiency level (with 3 levels) and treatment condition (with 4 levels) are jointly examined to explain the variation in learners’ (N = 120) scores on a subsequent grammar test (i.e. dependent variable). Table 1 presents the hypothetical results for this two-way design.
Hypothetical results of a fixed-effects 3 × 4 ANOVA (n = 120).
Note. Eta-squared values and their corresponding partial eta-squared values appear in bold. Inflation% = (ηp2 – η2) / η2 × 100, and this shows how different ηp2 and η2 can be in this two-way design.
If we want to quantify any of the independent variables’ contributions to the variation observed in post-test scores, we can do so by invoking the classical η2 in each case. But a different form of η2 may be computed as well. Cohen (1965) implicitly introduced a new variant of η2 (now often denoted by ηp2) in multi-way designs which was similar to the classical η2 formula with ‘other nonerror sources of variance being partialled out [from the denominator]’ (p. 105). Later, Cohen (1973) emphasized that this new variant is distinct from the classical η2 and may be called ‘partial η2’ (p. 108; italics in original). Thus, in multi-way designs, the term partial refers to removing all other possible sources of effect in the design except the one of interest in the denominator of equation (1) and the error/unexplained variance.
In our two-way design, which includes two main effects and one interaction effect, partial eta-squared (ηp2) for treatment condition (A) can be computed as:
Thus, ηp2 = 80 / (80 + 50) = .62 [90% CI: .494, .659]. Likewise, ηp2 for the effect of proficiency level (SOSB) can be computed in a similar fashion with other independent sources (i.e. treatment, and the treatment × proficiency interaction) removed from the denominator:
Therefore, for proficiency, ηp2 = 70 / (70 + 50) = .58 [90% CI: .458, .632]. And for the interaction effect (SOSA*B), we will have:
Thus, regarding the interaction effect, ηp2 = 5 / (5 + 50) = .09 [90% CI: .000, .133]. Note that because in one-way designs there is only one source of effect, no difference in the denominator of the classical and partial eta-squared formulas exists. In other words, because there are no other effects to be partialled out, eta-squared and partial eta squared are identical in one-way designs. However, as shown in Table 1, for our two-way design, ηp2 values are invariably larger – often much larger – than their η2 counterparts. This occurs because the partial η2 formula is partialling out the other nonerror terms (i.e. proficiency: SOSB and proficiency × treatment: SOSA*B) from the denominator for each effect, thus augmenting the outcome (see Grissom and Kim, 2012; Pedhazur, 1997). It is therefore critical that care be taken to report and interpret these indices appropriately.
III Assumptions and rationale of the present study
Having laid out the conceptual and statistical reasoning behind η2 and ηp2, the present study seeks to examine the use and interpretation of these two indices. The study is motivated by several factors that, in coordination, may create conditions that are counterproductive for the field’s progress. First, although effect sizes are regularly reported, they are not often interpreted and even less often are they interpreted meaningfully (see Larson-Hall and Plonsky, 2015; Plonsky and Oswald, 2014). Second, ANOVA designs are exceedingly common and therefore highly influential in L2 research. The family of effect sizes for this set of techniques is particularly prone to error, however, due to very similar and often ambiguous or even misleading labels, as described in Section II. This problem, observed in other social sciences, is only compounded by a lack of general statistical literacy in the field (Loewen et al., 2014). With these issues in mind, we anticipate that erroneous reporting of these frequently used effect sizes is likely to occur in L2 research. Therefore, in this study we examine the use of η2 and ηp2 as a means to improve future research practices in the field. With these concerns in mind, the present study sought to answer the following question: To what extent does published L2 research demonstrate erroneous reporting of ηp2 as representing η2?
IV Method
1 Journal selection and search criteria
In order to collect a representative sample of L2 research, we first consulted previous surveys of L2 research practices (e.g. Egbert, 2007; Gass, 2009; Lazaraton, 2005; Plonsky, 2013) as well as L2 research methods textbooks providing various L2 journals’ descriptions (e.g. Perry, 2011: Appendix C) and other documents discussing L2 journals (VanPatten and Williams, 2002). There is, of course, no consensus as to which journals are most prominent or influential in the field. In the end, we decided to survey the following five journals: Applied Linguistics, Language Learning, Language Teaching Research, Modern Language Journal, and System. This sample is by no means exhaustive but we would argue that it does provide generally representative view of quantitative L2 research.
In order to gain a current view of this domain, we limited our search to studies published from 2005 to 2015. In line with previous reviews (e.g. Gass, 2009), we excluded from consideration forums, short reports, commentaries, review articles, and book reviews. We then examined all studies that included variants of multi-way ANOVA (repeated measures, factorial, ANCOVA- henceforth, multi-way ANOVA studies). The total sample consisted of 156 studies. Our goal to include multi-way designs was because, as discussed in Section III, in these studies η2 and ηp2 lead to different results. Thus in these designs, mistakenly reporting ηp2 as η2 presents a distorted view of the results. Figure 1 shows the distribution of the sampled studies across the period 2005 through 2015.

Distribution of multi-way ANOVA studies over time.
2 Procedures and analyses
In order to address our research question, following best practices in synthetic research (see Plonsky and Oswald, 2015), each study in the sample was systematically coded for the design type (repeated measures, factorial, ANCOVA), model (fixed-, random-, mixed-effects), and sampling unit distribution (balanced, unbalanced) applied. We also extracted from each study F values, degrees of freedom, and descriptive statistics (Mean, SD). We then conducted secondary analyses by using any or a combination of the following three methods, as appropriate.
First, in line with previous studies that have examined the reporting and interpreting of η2 effect sizes (Levine and Hullett, 2002; Pierce et al., 2004), we computed the sum of the η2 values for every multi-way design in papers that reported them (i.e. ∑η2max limit check). When the sum for a multi-way design exceeded 1 (or equivalently 100%), the values were assumed to be representing instances of ηp2 labeled erroneously as η2. This method was applied to all 156 multi-way studies we collected. Using this technique, we found 17 studies with this type of erroneous reporting.
Second, we applied Cohen’s (1973) partial eta-squared meta-analytic equation which is computed as:
Equation 5 was used to evaluate if the values reported and labeled as η2 in reality were ηp2. Being ‘purely algebraic [i.e. insensitive to the design type and model]’ (Cohen, 1973: 107), this equation was applied to all designs types (e.g. repeated measures, ANCOVA) and models of multi-way analysis (i.e. fixed-, mixed-, and random-effects). If the answer from the manual calculations matched (within rounding error) those in the primary published report, we concluded that ηp2 values were mistakenly presented as η2. Also, when possible (i.e. when the design was fixed-effects with all relevant error and effect terms reported), we used Haase’s (1983) meta-analytic equation which for a two-way design is computed as:
Equation 6 was used to correctly compute η2 values in fixed-effects multi-way designs. For the second method, when no match was found between our calculation of η2 or ηp2 and those reported in the original paper, the analysis in question was excluded from our study. The second method was also uniformly applied to all 156 multi-way studies and resulted in the identification of an additional 12 studies with erroneous reporting of η2, which also confirmed and extended the results of method 1.
A third method for identifying erroneous reporting was applied when full summary tables (i.e. with all sum of squares, dfs, F values made available) were reported. Using these data, we separately computed the η2 and ηp2 values to compare them with the values appearing in the original published studies. The third method led to the identification of 5 additional studies which had inaccurately presented ηp2 effect size as representing η2.
V Results
The present study was intended to determine the extent of erroneous reporting of ηp2 as representing η2 in quantitative L2 research published between 2005 and 2015. Previous studies in other fields (Levine and Hullett, 2002; Pierce et al., 2004) were only able to show that ηp2 values were mistakenly reported as η2 if the sum of η2 values in a multi-way design exceeded 1 or 100% (method 1). As noted in the Section IV, we have sought to gain a more comprehensive view of this practice by employing additional equations (method 2) and in some cases directly computing ηp2 and η2 effect sizes from summary tables (method 3).
Table 2 presents the sum of η2 values (i.e. ∑η2; method 1) for the studies in our sample along with relevant data retrieved from these studies, and other features specific to the methods used for our secondary calculations. All 34 studies in Table 2, which we have anonymized, have incorrectly reported ηp2 as representing η2 values.
Summary of 34 studies erroneously reporting ηp2 as representing η2.
Notes. ‘n.r.’ = not reported; ‘n.a.’ = not applicable; a Not applicable: either summary table was presented (method 3) or equation’s 5 outcome matched that in the original report (method 2). b RM = Repeated measures. c One-way ANCOVA’s summary table terms are algebraically similar to those of two-way ANOVA. d GLM = General linear model.
As can be seen in Table 2, mistakenly reporting partial eta squared as representing eta squared is not uncommon in published quantitative L2 research. More precisely, this error occurred in 34 of the 156 studies in our sample, or 22%. Figure 2 shows the breakdown of the misreported studies in the 156 multi-way ANOVA studies published between 2005 and 2015. In Figure 2 the proportion of studies that misreported eta squared to the total multi-way ANOVAs in each year is represented by the cross-hatched columns. One important observation is that inaccurately presenting partial eta squared as representing eta squared is still present in recent L2 research. This might be due to that fact that multi-way ANOVA and its variants are frequently and increasingly employed to answer different substantive questions in L2 research (Plonsky, 2014).

Multi-way ANOVA studies that presented partial eta squared as representing eta squared (crosshatched bars).
VI Discussion
The confusion between η2 and ηp2, found in the present study to be fairly widespread in quantitative L2 research, can lead to, at least four problems that affect interpretations of findings to varying degrees. Some actual examples here may convey the general tenor of these problems.
First, while reporting ηp2 values in place of η2 values does not change the rank ordering of effects within a single study (see Table 1 for example), ηp2 and η2 values use different denominators in their formulas. Put succinctly, the base of each ηp2 value differs in nature from others in the same design because ηp2 values do not share a common base (i.e. denominator). Therefore, cross-effect comparison of ηp2 values is not meaningful (see Olejnik and Algina, 2000: 268; Pedhazur, 1997: 507). Interpretations are especially problematic, however, when ηp2 values (often expressed in percentages), either correctly labeled as ηp2 or erroneously presented as η2, are used to indicate that they have explained a certain amount of total variation in the dependent variable as exemplified in Table 3.
Misinterpreting ηp2 as proportion of total variance.
Second, η2 values are often upwardly biased (an issue not discussed here, but see Grissom and Kim 2012), and particularly so when the effects are based on small samples, which is often the case in L2 research (Plonsky, 2013). Therefore, erroneously reporting ηp2 as η2 in a multi-way study can inflate an already-biased η2 effect size even further. Pedhazur (1997) warned that ‘[b]ecause partial η2 tends to be larger than η2, I am afraid that novices will be [more] inclined to use it’ (p. 509). Thus, it is critical not to look at the effect sizes in a single study ‘from a bigger-is-better standpoint’ (Bakeman, 2005: 380). We were able to estimate the inflation due to mistaken reporting in some of the studies by using summary table information or by applying equation 5 for ηp2 and equation 6, when applicable, for η2. The inflation percentage may vary from one study to the next. Table 4 presents some examples along with the amount of inflation observed.
Inflation resulting from mistakenly presenting ηp2 for η2.
Notes. a Inflation% = (ηp2 – η2) / ηp2 × 100; b As reported by the original authors to 3 decimal places.
Third, for one-way ANOVA designs, reporting ηp2 instead of η2 seems to be following a logic of redundancy. And perhaps it further ‘indicates that researchers are either very passionate about unnecessary subscript letters [i.e. p for partial], or rely too much on the effect sizes as they are provided by statistical software packages’ (Lakens, 2013: 1). Curiously, we also found several instances of reporting ηp2 values in one-way ANOVA. As discussed, the case of presenting ηp2 in one-way ANOVA represents at least the nominal confusion of regarding ηp2 as η2; in one-way designs there are no other effects to be partialled out.
Fourth, Cohen’s (1988) benchmarks for interpreting effect sizes are arbitrary and should not be applied in L2 research or elsewhere (Cohen, 1988). Even so, the frequently used Cohen’s (1988: 283) proportion of variance effect size cut-off points (i.e. small = .0099; medium = .0588; large = .1379) may only relate to ‘partial eta squared’ values and not to those of ‘eta squared’ in multi-way designs (see Richardson, 2011). Thus, employing Cohen’s benchmarks (error 1) and erroneously applying them to eta squared (error 2) creates a ‘double-error’ situation. For example, if η2 is erroneously chosen to be benchmarked against Cohen’s (1988) cut-offs, one may interpret the magnitude of a given effect as ‘small’. However, for the same effect, if ‘ηp2’ is compared against Cohen’s benchmarks it may be interpreted as ‘large’.
It is useful at this point to recall that effect sizes are descriptive statistics that leave decision about the importance of an observed effect(s) to the community of researchers in any specific domain given (1) their understanding of the phenomenon they study, (2) prior studies in the same domain, (3) the predictions of theory, (4) practical implications, if any, (5) the design and instrumentation from which the effect was derived, and so forth. Looking ahead, we recommend that Cohen’s conventions be dropped in favor of researchers’ direct and explicit comparison of the effects in related literature as well as these and other considerations (see Plonsky and Oswald, 2014; Thompson, 2006).
The findings of this study, which reveal somewhat widespread misuse of a common statistic, prompt us to consider why this problem exists (and persists). One explanation might be the lack of appropriate reference material. In examining 14 texts on L2 research methods at our immediate disposal, the materials available to L2 researchers do not appear to address adequately the distinction between η2 and ηp2. For example, Larson-Hall (2012), a brief and generally quite useful chapter-length overview of statistics used in L2 research, briefly commented that ‘Effect sizes for ANOVA results are also of the same type as the correlation but use the Greek letter eta (η) and are called eta-squared or partial eta-squared’ (p. 249; italics added). However, no clear distinction is made between η2 and ηp2. In Phakiti (2014), another generally strong reference, no clear distinction between the use of eta-squared and partial eta-squared is made (see p. 205 and pp. 283–300).
Other L2 research methods textbooks that we reviewed likewise lacked sufficient discussion of the difference between the η2 and ηp2 effect size measures. Dörnyei (2007), for example, first provided a brief account of eta squared followed by presenting the formula for computing it. However, the only passing reference to ηp2 was made later in the context of ANCOVA: ‘The good news about SPSS output is that next to the significance value we find the “partial eta squared” index, which is an effect size’ (p. 223; italics added). A discussion on η2 and ηp2, however, did appear in Larson-Hall (2016) where a number of the same considerations addressed here as regards these two variants were usefully and clearly explained (see p. 149).
We suggest that future texts that discuss ANOVA explain all the terms that appear in a full table of summary results (see Thompson, 2006). It would be particularly useful if such a rubric would detail how all terms in the ANOVA results, including but not limited to η2 and ηp2, are (1) computed and (2) related to each other. When a reader is able to ascertain the relationship between all the terms in an ANOVA summary table, the distinction between η2 and ηp2 becomes more meaningful. As a final note, we would add that in many studies using ANOVA and its variants, researchers will want to go beyond the initial analysis to often perform pair-wise comparisons of groups’ mean scores. In such cases, it is not sufficient to report only the effect sizes for the ANOVA main and interaction effects; rather, an eta-squared (equivalent to a point-biserial correlation; rpb) or a standardized mean difference effect size such as Cohen’s d for each pair-wise comparison of interest should be reported and interpreted as well.
VII Conclusions
‘[A]ny effect size that is chosen from possible alternatives should be technically [and nominally] appropriate’ (Grissom and Kim 2012: 9). Evidence we provide in this article contains numerous and recent examples of erroneous reporting of often large ηp2 effect sizes in multi-way designs being misinterpreted and mislabeled as η2.
The distinction we draw in this article between η2 and ηp2 is in no way semantic or statistical nit-picking. These effect sizes are increasingly reported throughout quantitative L2 research. They also have immense potential to inform our understanding of L2 learning and use. Clarity about these indices, their reporting, and interpretation is, in fact, critical to arrive at appropriate conclusions regarding L2 theory and practice and, at the same time, to preventing misinterpretations that compromise work in this field.
We remind readers that, like many measures of effect size and in contrast to the dichotomous result embodied by p values, an η2 value provides a continuously-expressed result within a single multi-way ANOVA. Thus, reporting ηp2 values alone which (1) lack comparability advantages within a study and (2) are often larger than η2 values (see Section VI) may lead to erroneous interpretation. Added to the above problems is that ηp2 values depend on the model of analysis (i.e. fixed, random, and mixed). That is, for a given study, if we run the data analysis under fixed-, random-, or mixed-effects models, values of ηp2 for some treatment effects can change. Presenting the reasoning behind this dynamic requires knowledge of ‘Mean Squares Expectation Rubric’, which falls outside the scope of the present article (but see Thompson, 2006).
Based on these considerations, we encourage L2 researchers to compute, report and interpret, by default, (classical) η2 for all ANOVA-based analyses. This approach will provide an estimate of variance accounted for that is insensitive to the model of analysis as well as comparable within a single multi-way study. However, it is also useful for researchers to report ηp2 along with η2 to avoid the possibility of erroneous reporting and interpretation. In addition, in multi-way designs, reporting ηp2 facilitates the calculation of power for an effect and thus using the size of that effect as the basis for planning the sample size for a relevant, future research. Thus, eta- and partial eta-squared serve different purposes that legitimize the presence of both estimates in published multi-way ANOVA studies. 1 Finally, it is critical to note that reporting confidence intervals for ηp2 values is both highly recommended and possible via several statistical packages. For example, an easy-to-use program is the MBESS package in the R software program (Kelley, 2015). Unfortunately, confidence intervals for eta-squared are more complex (often roughly estimated) than those for partial eta-squared, and not currently widely available.
In closing, the results of this study do not present an ideal state of statistical proficiency in L2 research. Nevertheless, we are hopeful that the field’s momentum toward methodological reform – a movement to which the present study seeks to contribute – will continue to improve L2 research and reporting practices thereby leading to a clearer understanding of language learning and use.
Footnotes
Acknowledgements
We are deeply indebted to Professor Bruce Thompson for his insightful comments on the earlier drafts of this article.
Declaration of Conflicting Interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
