Calculating and Reporting Estimates of Effect Size in Counseling Outcome Research

Abstract

The reporting effect sizes (ESs) and confidence intervals (CIs) of ESs has become recommended practice in the social sciences; however, these values are frequently omitted by authors in manuscripts submitted for publication. Consequently, the meaningfulness and clinical relevance of their findings go unaddressed. As a result, a growing number of scholarly journals now require researchers to incorporate findings of clinical significance in their reporting of results. In this article, we review the most common conventions used for estimating and reporting ESs and CIs of ESs and illustrate how researchers can compute and interpret these measures of practical significance.

Keywords

effect size confidence intervals practical significance outcome research

Null hypothesis significance testing (NHST) is consistently the most widely accepted and frequently used approach toward assessing statistical inference in quantitative research since the 1940s (Hager, 2013; Lindquist, 1940; Nickerson, 2000). In NHST, the probability (p value) of obtaining a test statistic value (e.g., t, F, χ²) is calculated and compared to an arbitrarily selected α level (usually α = .05) a priori. When the computed p value is less than or equal to the chosen α level, the null hypothesis is rejected and the observed pattern of data is said to be sufficiently unlikely, given the null being true (Levine, Weber, Hullett, Park, & Lindsey, 2008). Although this approach is immensely popular, it is often criticized as being a trivial exercise utilizing a flawed and erroneous method of determining a statistical event (Cohen, 1990, 1994; Lockett, McWilliams, & Van Fleet, 2014; Norris, 2015; Thompson, 1996, 2007). These critics point to the inability of NHST to provide researchers with the information they actually want to know as one of its biggest faults. Instead of indicating the probability that a null hypothesis is true, NHST only tells us what the probability of obtaining data as extreme or greater would be if the null hypothesis were true. As a result, NHST results often are misunderstood, with users equating smaller computed p values to a treatment effect of large magnitude and assuming statistical significance being synonymous with practical significance.

Despite the criticism, the use of NHST continues to dominate the research literature across multiple disciplines (Howard, Maxwell, & Fleming, 2000; Nix & Barnette, 1998; Norris, 2015). The popularity of NHST lies in its simplicity. At its base level, statistical significance testing (p values) conveys to researchers whether or not sample results deviate significantly from null representations (i.e., hypothesis) of the population (Thompson, 2002), resulting in a rudimentary decision of whether to accept or reject an outcome (Page, 2014). From this perspective, analyses resulting in statistically significant outcomes are said to occur outside the realm of chance (Creswell, 2014). According to Thompson (2002), taking this approach is somewhat superficial and uninformative, despite providing researchers with leverage to remain within the realm of null hypothesis testing. For instance, results from a statistical test may indicate a noteworthy difference yet fail to describe whether the results are of importance or even pertain to the population of interest (Thompson, 2002, 2006). Based on these shortcomings, the continual reliance upon NHST may actually thwart researchers’ efforts in pursuit of knowledge, thereby eliminating the need to engage in innovative thinking (Trafimow & Marks, 2015). Motivated by the many concerns raised with using NHST and the interpretation of its results, researchers increasingly are encouraged to report alternative indicators of statistical significance, such as effect size (ES) metrics and confidence intervals (CIs) in their research articles (American Psychological Association [APA], 2010; Thompson, 2007). In fact, a number of trade publications (Journal of Counseling and Development, Journal of Mental Health Counseling, and Professional School Counseling) have adjusted their submission guidelines and are now requiring authors to report these new statistics. In addition, one journal, Basic and Applied Psychology, recently began banning the reporting of NHST altogether (Trafimow & Marks, 2015).

Conceptually speaking, practical significance testing communicates the extent to which results make a difference, inferring magnitude and directionality of the effect (i.e., strength of the findings; Kuhberger, Fritz, Lermer, & Scherndl, 2015). For instance, results may indicate that a specific counseling practice, such as solution-focused brief therapy, is effective for decreasing symptoms of depression among adults when compared to adolescents. From a statistical perspective, the only information available is the knowledge of an existence of a difference between adults and adolescents; however, practical information regarding the degree of the difference as well as the group benefiting most (i.e., adults or adolescents) is nonexistent. Thus, solely relying upon statistical significance in a practitioner-driven field seems antithetical, warranting counselors to inquire as to the meaningfulness and veracity of such results. Furthermore, intentional counselors driven by evidence-based practices should remain knowledgeable of the differences between statistical and practical significance due to confusion and misnomers surrounding these concepts that seem to remain prevalent in the literature (Wester, Borders, Boul, & Horton, 2013; Wilkinson, 2014).

As Thompson (1999) concisely summarized, statistical significance estimations fail to represent the importance of a result as well as the replicability and magnitude of its effect. As such, practical significant measures offer researchers consistent bits of information that can be analyzed across populations (i.e., meta-analyses). In the era of accountability, failing to report practical measures of effect is seemingly analogous to mistakenly forgetting to solicit for clients’ consent prior to experimentation 5% of the time (Thompson, 1999). From this perspective, CIs and ESs offer more clinical value to counselors based on the research they conduct (Page, 2014). The acknowledged value of adding these measures notwithstanding, a significant number of authors still do not report CIs or ES metrics in published research (Finch & Cumming, 2009; Wester et al., 2013). Furthermore, ESs computed in social sciences research are often quite small, complicating their interpretation for researchers (Ferguson, 2009). To facilitate greater acceptance of and improve the quality of CI and ES reporting, this article provides readers a brief overview of best practices for selecting, computing, and reporting on measures of effect and confidence levels of ESs (see Figure 1). Our aim is to help counseling researchers, especially those conducting outcomes-based research, more competently contribute to the professional literature and produce research findings of increased meaning.

Figure 1.

Four-step process for interpreting d-family effect sizes based on and estimating practical significance.

ES Measures

ES is a general term referring to a class of metrics depicting the magnitude of change among scores associated with an intervention and the degree to which detected results differ from the assumptions of a null hypothesis (Ellis, 2010; Thompson, 2007). Estimations of ES provide a standardization of their findings across measures, variables, and populations (Lipsey & Wilson, 2001) that are readily understandable by counselors who may not have advanced statistical training (Trusty, Thompson, & Petrocelli, 2004). In this way, the magnitude of a treatment effect will be the same regardless of which assessment a researcher is using, even if they are measuring the same construct. Reporting estimations of ES not only promotes comparisons and aggregation of results across studies but also assists counselors in converting their outcomes from an efficacy format to one that can provide evidentiary support for treatment effectiveness within a given population.

The myriad of ESs can be grouped into two classes: those that represent differences within and between groups (the d family) and those that depict association between variables (the r family). Although both are useful for counselors to consider when completing evaluation activities, covering the application of both families may be beyond the scope of this article. Therefore, our focus is on the former class of metrics that depict magnitude of change among gain scores and can be used within single group and between groups research. The d family of ESs can be implemented with continuous or dichotomous variables. However, not all ES metrics are equal, and decisions regarding which one to use are based on the characteristics and type of dependent variable under inspection. We will focus on ES metrics associated with continuous variables by identifying three approaches to estimating ES within the d family, presenting formulae for computation and strategies for interpretation.

ESs With Continuous Variables

Three commonly implemented metrics of ES with continuous variables are Cohen’s d, Glass’s Δ, and Hedge’s g (Fan & Konold, 2010; McMillan & Foley, 2011). Each of these metrics provides counselors with information about magnitude of treatment effect in relation to units of standard deviation (SD; Ellis, 2010). For example, an ES estimate of 1 means that the difference between measurement intervals (single group studies) or groups (between-groups studies) would indicate that the difference in scores is one SD, whereas an estimate of .25 would be indicative of a difference equivalent to one fourth of an SD. We recommended that counselors begin computing any of the ESs discussed herein by having relevant means, SDs, and sample sizes identified a priori.

Cohen’s d

Several authors characterize Cohen’s d as a biased estimation of ES because it does not account for the influence of sample size on the outcomes of a study (Erford, Savin-Murphy, & Butler, 2010; Fritz, Morris, & Richler, 2012). This metric is indicated when counselors are interested in estimating pre–post contrasts with a single group of participants, when sample size exceeds 50, or comparing gain scores between groups in low-stake scenarios when accounting for sample size is not a pressing concern. Limitations associated with Cohen’s d include inflated estimates of effect among smaller samples, inaccuracy when the homogeneity of variance assumption between SDs is not met, the yielded value is greatly influenced by the denominator within the equation, and it is not readily available in many statistics packages (Fritz et al., 2012). Despite these limitations, d has remained a standard for estimating ES due to the relative ease of computing with programs such as Microsoft Excel, R, and the abundance of online calculators.

Glass’s Δ

Glass’s Δ is an often lesser reported but important metric of ES when providing estimations, wherein assumptions of homogeneity of variances are not met, the SD of the treatment sample is not known, or a researcher expects an intervention to notably affect SD of a treated group (Ellis, 2010; Lakens, 2013). Glass’s Δ is also a biased ES and is based on the assumption that when variances are unequally distributed or information about treatment group SD is unavailable, the SD of the nontreated group is most representative of the distribution of a phenomenon of interest within a population (Ellis, 2010). The strength of using this approach to estimating ES is contingent upon the sample size of the control group because a small sample size may not be indicative of the general population. Computation of the ES is similar to that of Cohen’s d; however, Glass’s Δ uses the SD of the control groups as the denominator rather than pooled SD. Although many statistics packages do not readily compute this value and the presence of open-source online resources is almost absent, this metric is easily calculated by hand or by using data management software such as Microsoft Excel.

Hedge’s g

Hedge’s g was developed as a biased alternative to Cohen’s d that accounts for the influence that sample size and sampling error have on the outcome associated with an intervention (Ellis, 2010; Erford et al., 2010). The sensitivity to sample size and sampling error is achieved by weighting the SD of each group within the ES formula. This metric is recommended when sample sizes are less than 50, when sample sizes are unequal, or when aggregating ESs across studies (Ellis, 2010; Lakens, 2013). The corrected scores for g are relatively small when sample size is large; however, when sample size is small, the corrected value estimating treatment effect can be quite disparate (Fritz et al., 2012). Like the other d family metrics, g is fairly easy to compute by hand or readily yielded from use of database management programs, specialized software, and free, online resources.

Estimating the Precision of Computed ESs

Scholars have advocated for the inclusion of ES measures in the reporting of published research for decades (APA, 2010; Cumming, Fidler, Kalinowski, & Lai, 2012; Hammond, 1996; Thompson, 2006; Trusty et al., 2004). Although ESs allow researchers to more effectively communicate the practical significance of their findings, the reporting of simple ESs alone may not accurately convey the strength of the relationship between variable studies. ESs, like other statistics, are subject to sampling uncertainties and measurement error. The presence of error can have a significant impact on the interpretation of a reported ES measure (Burchinal, 2008). As a result, researchers are encouraged to report ES measures along with some measure of their sampling uncertainty (Hedges, 2008). A common approach to addressing the issue of estimate precision is to construct CIs around the computed ES statistic.

CIs of ESs

Because simple measures of ES often fail to account for chance or random error, CIs can be computed to measure the degree of uncertainty included in these computed point estimates (e.g., ESs). A CI is a range of plausible values in which an unknown true population parameter is likely to be contained a specified percentage of the time (Gravetter & Wallnau, 2012). This range of values is calculated around a point estimate and includes values both above and below the point estimate. How often a CI includes the true unknown population parameter, in this case the true ES, is known as the confidence of the CI. In social science research, a 95% CI is most commonly used. The decision to use a 95% CI originates from the traditional convention of selecting a p value of .05 to denote the threshold for statistical significance (McCormack, Vandermeer, & Allan, 2013). Related to ES measures, establishing confidence at the 95% level indicates that the researcher expects the true ES to fall within the established CI in 95% of all possible replications of the study. In the remaining 5%, the true ES would be expected to fall outside the established CI. Although the 95% CI is used most commonly, the decision to do so is left to the purview of the researcher. In some cases, researchers could either choose a CI with a greater (99%) or lesser (90%) degree of confidence depending on the research questions being addressed and the nature of their research design.

In addition to confidence, another important characteristic of a CI is its width (Gravetter & Wallnau, 2012). The width of a CI is measured by taking the difference between the upper and lower limits of the interval and relating it to the precision of the estimate. Wider intervals correspond to greater confidence. The wider the CI, the more confident a researcher can be that it likely captures the true unknown ES. However, increases in confidence result in lower levels of precision. To demonstrate this point, we examine how increasing confidence impacts interval width. In the illustrative example provided later in this article, a group of adolescent girls assigned to either a relational conflict resolution (RCS) group or programming as usual (PAU) group were assessed to determine the impact of treatment on engagement in violence. To assess the magnitude of the treatment effect observed, a Hedge’s g value of −.58 was computed. The 95% CI constructed around this ES metric would be [−1.39, 0.13], resulting in an interval width of 1.52 units. When a 99% CI is used, the interval estimate grows to [−1.51, 0.35] with an interval width of 1.86 units. Conversely, lowering confidence levels will result in a higher degree of precision. For the same data, the 90% CI would be represented by an interval estimate of [−1.03, 0.15]. The corresponding interval width of 1.18 units is smaller than the widths for either the 95% or 99% CIs, reflecting less confidence but a higher degree of precision.

When selecting a CI, researchers should aim for precision over confidence. Although greater confidence equates to a larger interval estimate, these intervals are imprecise and give very little insight into the true population parameter (Liu, Loudermilk, & Simpson, 2014). The precision of an interval estimate is affected by two sample statistics: SD and sample size. The smaller the SD, the more similar individual scores are among the sample group. This decrease in variability results in greater estimate precision. Since researchers cannot manipulate the SD of a sample of scores, the more practical approach to improving estimate precision is through sample size manipulation. By increasing the size of their sample, researchers improve precision. Since larger samples have a greater propensity to approximate the unknown population and result in more precise estimates, researchers should strongly consider collecting additional data before making any definitive conclusions based on their results (Akobeng, 2008; Cohen, 1994).

Constructing ES CIs

CIs are calculated based on the standard error (SE) of a measurement. Once the SE is calculated, the CI can be determined by multiplying the SE by a constant that reflects the level of confidence desired (e.g., 95%) based on the normal distribution. The type of CI constructed depends on the information available to the researcher. When the researcher has prior knowledge of the population SD (σ), a z-score approximation is used in the formula for constructing a CI. In this case, the formula for computing a CI around an ES metric would be:

CI = ES \pm z^{*} S E_{ES} .

In this formula, ES represents the computed ES, z^*) represents the z-score corresponding to the desired confidence level, and SE _ES represents the SE for the computed ES value. While researchers have the option of selecting any size confidence level, the z-scores corresponding to the most commonly selected CIs are 1.65 (90% confidence), 1.96 (95% confidence), and 2.58 (99% confidence). Examples of the CI computations using this formula for the various ES metrics are included in Tables 1 –3.

Table 1.

Cohen’s d Effect Size (ES) and Confidence Interval (CI) Formulas.

ES Metric	Associated Formulas	Procedure	Application to Data Set
Cohen’s d	$d = \frac{M_{1} - M_{2}}{S D_{pooled}}$	1. Identify the posttest means for the treatment (M ₁) and comparison (M ₂) groups	M ₁ = 1.29 M ₂ = 1.49
	$S D_{p} = \sqrt \frac{{(S D_{1})}^{2} + {(S D_{2})}^{2}}{2}$	2. Compute pooled standard deviation (SD) for the treatment (SD ₁) and comparison (SD ₂) groups	$S D_{p} = \sqrt{\frac{{(.22)}^{2} + {(.25)}^{2}}{2}}$ = .24
		3. Divide the product of the comparison group mean minus the treatment group mean by the pooled SD	$d = \frac{1.29 - 1.49}{0.24} = - 0.83$
		4. Determine the amount of confidence you wish to have in your estimate and apply appropriate z-score value	z = 1.65 (90%) z = 1.96 (95%) z = 2.58 (99%)
	$S E_{ES} = \sqrt{\frac{n_{1} + n_{2}}{n_{1} n_{2}} + \frac{{ES}^{2}}{2 (n_{1} + n_{2})}}$	5. Compute SE for ES found in Step 3	$S E_{ES} = \sqrt{\frac{18 + 15}{(18) (15)} + \frac{- {0.83}^{2}}{2 (18 + 15)}} = 0.36$
	$CI = ES \pm z^{*} S E_{ES}$	6. Compute desired CI using the values identified in Steps 4 and 5 above	$CI = - 0.83 \pm 1.96 (0.36) =$ [−0.12, 1.54]

Table 2.

Glass’s Δ Effect Size (ES) and Confidence Interval (CI) Formulas.

ES Metric	Associated Formulas	Procedure	Application to Data Set
Glass’s Δ	$Δ = \frac{M_{1} - M_{2}}{S D_{control}}$	1. Identify the posttest means for the treatment (M ₁) and comparison (M ₂) groups	M ₁ = 1.29 M ₂ = 1.49
		2. Divide the product of the comparison group mean minus the treatment group mean by the SD of the comparison group SD	$Δ = \frac{1.29 - 1.49}{0.25}$ = −0.8
	$CI = ES \pm z^{*} S E_{ES}$	3. Determine the amount of confidence you wish to have in your estimate and apply appropriate z-score value	z = 1.65 (90%) z = 1.96 (95%) z = 2.58 (99%)
	$S E_{ES} = \sqrt{\frac{n_{1} + n_{2}}{n_{1} n_{2}} + \frac{{ES}^{2}}{2 (n_{1} + n_{2})}}$	4. Compute SE for ES found in Step 2	$S E_{ES} = \sqrt{\frac{18 + 15}{(18) (15)} + \frac{- {0.8}^{2}}{2 (18 + 15)}} = 0.36$
		5. Compute desired CI using the values identified in Steps 3 and 4 above	$CI = - 0.8 \pm 1.96 (0.36) =$ [−0.09, −1.51]

Table 3.

Hedge’s g Effect Size (ES) and Confidence Interval (CI) Formulas.

ES Metric	Associated Formulas	Procedure	Application to 4Data Set
Hedge’s g		1. Compute Cohen’s d metric	d = −.83
	$g = d (1 - \frac{3}{4 N - 9})$	2. Multiply d value by the product of 1 minus 3 divided by 4N − 9	$g = - 0.83 (1 - \frac{3}{4 (33) - 9}) = - 0.81$
		3. Determine the amount of confidence you wish to have in your estimate and apply appropriate z-score value	z = 1.65 (90%) z = 1.96 (95%) z = 2.58 (99%)
	$S E_{ES} = \sqrt{\frac{n_{1} + n_{2}}{n_{1} n_{2}} + \frac{{ES}^{2}}{2 (n_{1} + n_{2})}}$	4. Compute SE for ES found in Step 3	$S E_{ES} = \sqrt{\frac{18 + 15}{(18) (15)} + \frac{- {0.81}^{2}}{2 (18 + 15)}} = 0.36$
	$CI = ES \pm z^{*} S E_{ES}$	5. Compute desired CI using the values identified in Steps 3 and 4 above	$CI = - 0.81 \pm 1.96 (0.36) =$ [0.10, −1.52]

Interpreting ES CIs

Whereas hypothesis testing is concerned only with statistical significance, CIs can be used to convey both statistical and clinical information. In this section, we discuss two distinct ways to properly interpret CIs. A third interpretation of CIs relates to their ability to indicate statistical significance. However, this interpretation is beyond the scope of this article and its focus on interpreting CIs for ESs.

CI as an interval of plausible values

One advantage of using CIs is that they provide information related to the range of effect of a treatment or intervention counselors might find useful in making clinical decisions related to client care and treatment (Stratford, 2010). Using the 95% CI constructed around the Hedge’s g value earlier, we would say that we were 95% confident that the effect of participating in the RCS group is not more than 1.39 units less nor 0.13 units more than the computed ES metric.

CI as a part of an infinite sequence

A second interpretation of CIs refers to their ability to convey the uncertainty of information obtained from a single sample. Theoretically, the CI we compute around a point estimate is only one of an infinite amount of possible CIs that could have been computed. Using a 95% CI means that 95% of all potential CIs that could have been computed would include the true population parameter and be classified as a true statement. In the remaining 5% of potential CIs, the true population parameter would be missed resulting in a false statement.

Interpreting ESs and Estimating Practical Significance

When interpreting yielded ESs and estimating the practical significance of an intervention, we suggest using the following four-step process.

Step 1: Reference Conventions

First, consider an ES value in reference to conventions suggested by previous researchers. For example, within d-family ESs, these may include conventions for primary studies suggested by Cohen (1988) for small (ES ≤ .20), medium (ES = .50), or large (ES ≥ .80) or Lipsey and Wilson (2001) for meta-analyses for describing magnitudes as small (ES ≥ .30), medium (ES ≥ .50), and large (ES ≥ .67). These conventions are a useful starting place for counselors but should be used only as a starting place and not an end in itself because crude explanations such as “indicative of a medium effect” do not convey meaningful information and can obscure the practical significance of these values. For example, a small effect (ES = .20) may not be enough of an observed change to warrant overhauling an afterschool supplemental education program, but 20% of an SD difference would certainly have practical significance in the context of recidivism among first time offenders.

Step 2: Convert to SD Units

Next, we recommend reporting a yielded ES value in terms of SD units. For example, a d-family ES of .50 represents a magnitude of change that is 50% of 1 SD unit of change. In the case of between-group evaluations, this would indicate that one group outperformed the other by about half of an SD. Similarly, within a single-group, preexperimental design, this would indicate that the group reported change along a construct over time to a degree of about half of an SD. Because ES is a measure-free metric, this information encourages counselors to consider what half an SD may look like with their population and among the assessments that they use.

Step 3: Estimate Precision of ES

Third, we recommend providing a statement about the degree of precision associated with a yielded ES that is based on the CI surrounding the value. While there are no steadfast conventions for interpreting what CI width is best, CI bands that are glaringly wide may be of concern as they represent an absence of precision associated with our ES. Additionally, it is important to remember that in the event that a CI interval band exceeds a value of 0, our ES should be regarded as an untrustworthy estimate of treatment effect that is likely influenced by sampling error.

Step 4: Situate Values Into Context

Once an ES has been reported by convention, converted to SD units, and described in terms of precision, it is important to situate the findings within their clinical or educational context and describe the potential practical impact that results may represent. This requires a systemic understanding of the value of an intervention associated with the degree of change that is detected. After all, what is reported as a small ES may contribute to large changes over time and even large effects may apply to so few people that they do to influence the overall functioning of a community or system.

Illustrative Example

A fabricated data set was developed to provide a reference for calculating each of the three approaches to estimating ES in the context of counseling research. Consider the example of a 6-week RCS group for adolescent girls in an urban setting intended to decrease the likelihood of engaging in violence. Within this sample, the RCS treatment group of 18 girls and PAU group of 15 girls had mean pretest ratings on the Likelihood of Violence and Delinquency Scale (LVDS; Flewelling, Paschall, & Ringwalt, 1993) of 1.43 (SD = .48) and 1.42 (SD = .43), respectively. Over the next 6 weeks, groups were exposed to either the treatment or control condition. One week following the completion of the intervention, ratings on the LVDS yielded mean group scores of 1.29 (SD = .22) and 1.49 (SD = .25) for the treatment and PAU groups, respectively. Using NHST, a researcher would be able to report whether a statistically significant difference exists between the groups. However, by reporting ES, the researcher now has the ability to more clearly interpret the substantive significance of the findings and understand them in the context of the problem being examined.

Calculations of ESs and Their Corresponding CIs

Utilizing the descriptive statistics contained in the fabricated data set, we now demonstrate the steps involved in computing the various ES measures described in this article and their corresponding CIs. Each of these steps is further depicted in Tables 1–3.

Cohen’s d

The following steps and calculations are described in Table 1. To compute Cohen’s d, we need the mean posttest LVDS rating scores for each group (Step 1) and the pooled SD for the combined groups. The posttest mean scores are given to us in the illustrative example. For the treatment and PAU groups, the mean posttest ratings were 1.29 and 1.49, respectively. To compute the pooled SD (Step 2), we need to square the individual SD values for each group, add these two values together, divide the sum by two, and then square root the resulting value. Using the illustrative example data and these steps, we would obtain a value of √[(0.22)² + (0.25)²/2] = √(0.11/2) = √(0.06) = 0.24 for our pooled SD. To compute Cohen’s d, we need to take the difference between the two groups’ mean posttest rating scores and divide by the pooled SD (Step 3). This gives us a Cohen’s d ES value of (1.29 – 1.49)/0.24 = −0.83.

To compute the CI for this ES value, we first need to determine how much confidence we want to have in our CI (Step 4). For this example, we will use a 95% CI. Based on a 95% CI, we have 5% error in our study. The z-score corresponding to a two-tailed study with 5% error is 1.96. Next, we need to compute the SE for the ES value we computed above (Step 5). The formula for computing SE of ES tells us to add the sample sizes for each group (n ₁ and n ₂) and divide by the product of multiplying these two sample size values together. To this value, we will add ES squared divided by two times the sum of the two sample size values. Lastly, we will take the square root of that sum. Using our existing data and these steps, we have √[(18 + 15)/(18) (15)] + [−0.83²/2(18 + 15)] = √(0.12) + (0.01) = √(0.13) = 0.36. To find the 95% CI (Step 6), we multiply SE value (0.36) by the approximated z-score value (1.96) and then both add and subtract this value to our computed ES measure (−0.83). Performing these functions gives us (−0.83 ± 0.71) for a range of −0.12 to −1.54.

Glass’s Δ

The following steps and calculations are described in Table 2. To compute Glass’s Δ, we again need the mean posttest LVDS rating scores for each group (Step 1). Recall that for the treatment and PAU groups, the mean posttest ratings were 1.29 and 1.49, respectively. Now, instead of using the pooled SD, we will be using only the SD for the PAU control group (Step 2). Dividing the difference between the two posttest mean rating scores by the SD for the PAU group gives us [(1.29 − 1.49)/(0.25)] = −0.8 as Glass’s Δ ES value.

As with Cohen’s d, we need to determine an appropriate level of confidence (Step 3). Based on the 5% error in our study, we will continue using the 95% CI. To compute the CI for this ES measure, we apply the same steps we used to compute the CI for the Cohen’s d value inserting the new ES value (−0.8) into the equation (Step 4). Doing so gives us the following √[(18 + 15)/(18)(15)] + [−0.8²/2(18 + 15)] = √(0.12) + (0.01) = √(0.13) = 0.36. The 95% CI is then found by multiplying this SE value (0.36) by the approximated z-score value (1.96) and then adding and subtracting this value to −0.80, our computed ES measure (Step 5). Performing these functions gives us (−0.8 ± 0.71) for a range of −0.09 to −1.51.

Hedge’s g

The following steps and calculations are described in Table 3. Hedge’s g is calculated using the Cohen’s d value (−0.83) computed earlier (Step 1) and multiplying it by the product of one minus three divided by 4 times the total number of study participants (N) minus nine (Step 2). Note that the values used in this equation are constants and will remain the same for all Hedge’s g calculations you may make. Applying these steps, we have a formula that looks like this −0.83 (1 – 3/[4(33) − 9]) = −0.81.

Again, we will use a 95% CI because we are assuming 5% error in our study (Step 3). The CI is calculated applying the same steps used to compute the CI for the Cohen’s d value (Step 4). Inserting the newly computed ES measure (−.81) into our now standard equation gives us √[(18 + 15)/ (18)(15)] + [−0.81²/2(18 + 15)] = √(0.12) + (0.01) = √(0.13) = 0.36. Lastly, we find the 95% CI (Step 5) by multiplying this SE value (0.36) by the approximated z-score value (1.96) and then adding and subtracting this value to our computed ES measure (−0.81). The result is a 95% CI range of −0.10 to −1.52.

Interpreting Results and Estimating Practical Significance

To demonstrate the application of the four-step method for interpreting ESs and estimating practical significance described earlier, we will apply the method to the Hedge’s g results associated with the fabricated data set previously described. First, inspection of group differences across scores for the LVDS subscale suggested that completing the conflict resolution program (g = −0.81) was associated with a large ES in reference to Cohen’s (1988) conventions for interpretation. This finding suggests that participants who completed the conflict resolution program reported LVDS scores about 81% of one SD less than those who completed PAU. However, inspection of the CI associated with the detected point estimate indicates a modest degree of accuracy, wherein the magnitude of treatment effect may be between −1.52 and −10. This detected point estimate is greater than previous studies of conflict resolution programming delivered with urban adolescents (Balkin, Miller, Ricard, Garcia, & Lancaster, 2011; Lancaster, Lenz, Meadows, & Brown, 2013). Given that participants in both groups reported similar frequency and duration of LVDS symptoms at intake, these findings indicate that conflict resolution programming was associated with a degree of change that was above average when compared to the typical adolescent within our sample who is receiving PAU. It is reasonable to believe that the interpersonal processes and skills modules inherent within the conflict resolution program may have contributed to meaningful decreases in the likelihood of engaging in violence among participants.

Conclusion

In this article, we introduced readers to the various d-family ES measures commonly used to represent differences within and between groups, illustrated the computation of ES metrics and their corresponding CIs, and explained how they should be interpreted and addressed in scholarly writing. Using these conventions, counseling researchers conducting outcomes-based research can better speak to the practical significance of their findings and help bridge the gap between counseling research and counseling practice. Although several researchers (Ellis, 2010; Lakens, 2013; Valentine, Aloe, & Lau, 2015) have illustrated the importance of using of ES metrics to estimate the practical differences associated with treatment effects, applications of these procedures to counseling outcomes are unique given high-stake implications for the use of evidence-based practices within the current sociopolitical context surrounding public health. For this reason, counselors are obligated by the American Counseling Association Code of Ethics (2014) to report research findings accurately without misrepresenting their data. We submit that reporting and accurately interpreting ES and the related CIs are two important processes in promoting transparency for consumers of research and comparisons of findings across studies by primary researchers. This article depicted one family of ESs (d family), but certainly this imperative exists across others not mentioned.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Akobeng

A. K.

(2008). Confidence intervals and p-values in clinical decision making. Acta Paediatrica, 97, 1004–1007. doi:10.111/j.1651-2227.2008.00836.x

American Counseling Association. (2014). ACA code of ethics. Retrieved from http://www.counseling.org/docs/ethics/2014-aca-code-of-ethics.pdf?sfvrsn=4

American Psychological Association. (2010). Publication manual of the APA (6th ed.). Washington, DC: Author.

Balkin

R. S.

Miller

Ricard

Garcia

Lancaster

(2011). Assessing factors in adolescent adjustment as precursors to recidivism in court-referred youth. Measurement and Evaluation in Counseling and Development, 44, 52–59. doi:10.1177/0748175610391611

Burchinal

M. R.

(2008). How measurement error affects the interpretation and understanding of effect sizes. Child Development Perspectives, 2, 178–180.

Cohen

(1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.

Cohen

(1990). Things I have learned (so far). American Psychologist, 45, 1304–1312. doi:10.1037/0003-066x.45.12.1304

Cohen

(1994). The earth is round (p < .05). American Psychologist, 49, 997–1003.

Creswell

J. W.

(2014). Research design: Qualitative, quantitative, and mixed methods approaches. Thousand Oaks, CA: Sage.

10.

Cumming

Fidler

Kalinowski

Lai

(2012). The statistical recommendations of the American Psychological Association Publication Manual: Effect sizes, confidence intervals, and meta-analysis. Australian Journal of Psychology, 64, 138–146. doi:10.111/j.1742-9536.2011.00037.x

11.

Ellis

P. D.

(2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results. Cambridge, England: Cambridge University Press.

12.

Erford

B. T.

Savin-Murphy

J. A.

Butler

(2010). Conducting a meta-analysis of counseling outcome research: Twelve steps and practical procedures. Counseling Outcome Research and Evaluation, 1, 19–43. doi:10.1177/2150137809356682

13.

Fan

Konold

T, R.

(2010). Statistical significance versus effect size. In Peterson

Baker

McGaw

(Eds.), International encyclopedia of education (Vol. 7, pp. 444–450). Oxford, England: Elsevier.

14.

Ferguson

C. J.

(2009). An effect size primer: A guide for clinicians and researchers. Professional Psychology, 40, 532–538. doi:10.1037/a0015808

15.

Finch

Cumming

(2009). Putting research in context: Understanding confidence intervals from one or more studies. Journal of Pediatric Psychiatry, 34, 903–916. doi:10.1093/jpepsy/jsn118

16.

Flewelling

R. L.

Pashcall

M. J.

Ringwalt

C. L.

(1993). SAGE baseline survey. Research Triangle Park, NC: Research Triangle Institute.

17.

Fritz

C. O.

Morris

P. E.

Richler

(2012). Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychology: General, 141, 2–18. doi:10.1037/a0024338

18.

Gravetter

F. J.

Wallnau

L. B.

(2012). Statistics for the behavioral sciences (9th ed.). Belmont, CA: Wadsworth.

19.

Hager

(2013). The statistical theories of Fisher and of Neyman and Pearson: A methodological perspective. Theoretical Psychology, 23, 251–270. doi:10.1177/0959354312465483

20.

Hammond

(1996). The objections to null hypothesis testing as a means of analyzing psychological data. Australian Journal of Psychology, 48, 104–106. doi:10.1080/00049539608259513

21.

Hedges

L. V.

(2008). What are effect sizes and why do we need them? Child Development Perspectives, 2, 167–171.

22.

Howard

G. S.

Maxwell

S. E.

Fleming

K. J.

(2000). The proof of the pudding: An illustration of the relative strengths of null hypothesis, meta-analysis, and Bayesian analysis. Psychological Methods, 5, 315–332.

23.

Kuhberger

Fritz

Lermer

Scherndl

(2015). The significance fallacy in inferential statistics. BMC Research Notes, 8, 1–9. doi:10.1186/s13104-015-1020-4

24.

Lakens

(2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 1–12. doi:10.3389/fpsyg.2013.00863

25.

Lancaster

Lenz

A. S.

Meadows

Brown

K. C.

(2013). Evaluation of a conflict resolution program for urban adolescent girls. Journal for Specialists in Group Work, 38, 225–240. doi:10.1080/01933922.2013.804897

26.

Levine

T. R.

Weber

Hullett

C. R.

Park

H. S.

Lindsey

(2008). A critical assessment of null hypothesis significance testing in quantitative communication research. Human Communication Research, 34, 171–187.

27.

Lindquist

E. F.

(1940). Statistical analysis in educational research. Boston, MA: Houghton Mifflin.

28.

Lipsey

M. W.

Wilson

D. B.

(2001). Practical meta-analysis. Thousand Oaks, CA: Sage.

29.

Liu

X. S.

Loudermilk

Simpson

(2014). Introduction to sample size choice for confidence intervals based on t statistics. Measurement in Physical Education and Exercise Science, 18, 91–1000. doi:10.1080/1091367X.2013.864657

30.

Lockett

McWilliams

Van Fleet

D. D.

(2014). Reordering our priorities by putting phenomena before design: Escaping the straitjacket of null hypothesis significance testing. British Journal of Management, 25, 863–873. doi:10.1111/1467-8551.12063

31.

McCormack

Vandermeer

Allan

G. M.

(2013). How confidence intervals become confusion intervals. BMC Medical Research Methodology, 13, 134. Retrieved from https://dx-doi-org.web.bisu.edu.cn/10.1186/1471-2288-13-134

32.

McMillan

J. H.

Foley

(2011). Reporting and discussing effect size: Still the road less traveled? Practical Assessment, Research & Evaluation, 16. Retrieved from http://pareonline.net/getvn.asp?v=16&n=14

33.

Nickerson

R. S.

(2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5, 241–301. doi:10.1037/1082989X.5.2.241

34.

Nix

T. W.

Barnette

J. J.

(1998). The data analysis dilemma: Ban or Abandon. A review of null hypothesis significance testing. Research in the School, 5, 3–14.

35.

Norris

J. M.

(2015). Statistical significance testing in second language research: Basic problems and suggestions for reform. Language Learning, 65, 97–126. doi:10.1111/lang.12114

36.

Page

(2014). Beyond statistical significance: Clinical interpretation of rehabilitation research literature. International Journal of Sports Physical Therapy, 9, 726–736.

37.

Stratford

(2010). The added value of confidence intervals. Physical Therapy, 90, 333–335. doi:10.2522/ptj.2010.90.3.333

38.

Thompson

(1996). AERA editorial policies regarding statistical significance testing: Three suggested forms. Educational Researchers, 25, 26–30. doi:10.2307/1176337

39.

Thompson

(1999). Why “encouraging” effect size reporting is not working: The etiology of researcher resistance to changing practices. Journal of Psychology, 133, 133–140. doi:10.1080/00223989909599728

40.

Thompson

(2002). “Statistical,” “Practical,” and “Clinical”: How many kinds of significance do counselors need to consider? Journal of Counseling & Development, 80, 64–71. doi:10.1002/j.1556-6678.2002.tb00167.x

41.

Thompson

(2006). Role of effect sizes in contemporary research in counseling. Counseling and Values, 50, 176–186. doi:10.1002/j.2161-007X.2006.tb00054.x

42.

Thompson

(2007). Effect sizes, confidence intervals, and confidence intervals of effect sizes. Psychology in the Schools, 44, 423–432. doi:10.1002/pits.20234

43.

Trafimow

Marks

(2015). Editorial. Basic and Applied Social Psychology, 37, 1–2. doi:10.1080/01973533.2015.1012991

44.

Trusty

Thompson

Petrocelli

J. V.

(2004). Practical guide to reporting effect size in quantitative research in the Journal of Counseling and Development, 82, 107–110.

45.

Valentine

J. C.

Aloe

A. M.

Lau

T. S.

(2015). Life after NHST: How to describe your data without “p-ing” everywhere. Basic and Applied Social Psychology, 37, 260–273. doi:10.1080/01973533.2015.1060240

46.

Wester

K. L.

Borders

L. D.

Boul

Horton

(2013). Research quality: Critique of quantitative articles in the Journal of Counseling and Development. Journal of Counseling and Development, 91, 280–290. doi:10.1002/j.1556-6676.2013.00096.x.

47.

Wilkinson

(2014). Distinguishing between statistical significance and practical/clinical meaningfulness using statistical inference. Sports Medicine, 44, 295–301. doi:10.1007/s40279-013-0125-y