Comparing Alternative Corrections for Bias in the Bias-Corrected Bootstrap Test of Mediation

Abstract

Although the bias-corrected (BC) bootstrap is an often-recommended method for testing mediation due to its higher statistical power relative to other tests, it has also been found to have elevated Type I error rates with small sample sizes. Under limitations for participant recruitment, obtaining a larger sample size is not always feasible. Thus, this study examines whether using alternative corrections for bias in the BC bootstrap test of mediation for small sample sizes can achieve equal levels of statistical power without the associated increase in Type I error. A simulation study was conducted to compare Efron and Tibshirani’s original correction for bias, z ₀, to six alternative corrections for bias: (a) mean, (b–e) Winsorized mean with 10%, 20%, 30%, and 40% trimming in each tail, and (f) medcouple (robust skewness measure). Most variation in Type I error (given a medium effect size of one regression slope and zero for the other slope) and power (small effect size in both regression slopes) was found with small sample sizes. Recommendations for applied researchers are made based on the results. An empirical example using data from the ATLAS drug prevention intervention study is presented to illustrate these results. Limitations and future directions are discussed.

Keywords

bias-corrected bootstrap tests of mediation corrections for bias indirect effects Winsorized means

In evaluating programs for health care, training services, or teaching methods, the goal is to determine whether the program has desired effects on outcomes of interest to warrant continued resource allotment, implementation, or funding. Many studies, however, have found that causal relationships are not always direct; the implemented program is not directly causing the outcome variables to change. Instead, causal relationships may involve one or more intermediate variables, known as mediators. For example, Maruska et al. (2016) found that a smoking prevention program (independent variable) indirectly affected adolescent smoking onset (outcome variable) by decreasing perceived peer smoking prevalence and increasing confidence to say “no” to cigarettes (mediators). If mediators are present, statistical analyses can be conducted both to examine whether a program affects the outcome variables and to identify potential causal mechanisms contributing to the effectiveness of the program. By identifying how the smoking prevention program was affecting smoking onset, Maruska et al. (2016) could recommend building substance-specific skills and cognitions into future programs.

Testing indirect effects (i.e., the effect of one variable on another through intermediate variables) for statistical significance is known as statistical mediation analysis. Mediation analysis can be complicated because the sampling distribution of indirect effects is often not normally distributed, so the use of traditional normal-theory statistical tests may result in incorrect inferences (Lomnicki, 1967; Springer & Thompson, 1966). While some work has focused on tests for mediation that make distributional assumptions other than normality (e.g., Cheung, 2007; MacKinnon, Fritz, et al., 2007; MacKinnon et al., 2004), a more popular solution is to use nonparametric resampling methods, such as the bootstrap (Efron & Tibshirani, 1993; MacKinnon et al., 2004), that do not make a distributional assumption at all.

Many researchers have recommended using the bootstrap for statistical mediation analysis (e.g., Bollen & Stine, 1990; Hayes & Scharkow, 2013; Preacher & Hayes, 2004; Shrout & Bolger, 2002). MacKinnon et al. (2004) compared the performance of multiple variations of the bootstrap for testing indirect effects including the percentile, bias-corrected (BC), and accelerated BC bootstrap tests. They found that the BC bootstrap had the highest statistical power, but also elevated Type I error rates in certain conditions. In a follow-up study, Fritz et al. (2012) found the Type I error rates for the BC bootstrap were most inflated when the sample size was small. This is especially problematic given that the BC bootstrap is often used in studies with low statistical power due to small sample sizes (e.g., Fuller-Rowell et al., 2017; Lundgren et al., 2008; McManus et al., 2012; Sella et al., 2016; Tallman et al., 2007), the exact situation where the Type I error rates for the BC bootstrap are the worst. In considering sample size limitations, as may be encountered in health research, it is important to determine if the BC bootstrap can be modified in order to maintain the increased statistical power without the inflated Type I error rate, keeping it instead at the targeted .05 level. The purpose of this study is to identify and compare alternative corrections for bias in the BC bootstrap test of mediation.

The Single-Mediator Model

Statistical mediation analysis is used to examine the indirect effect within a causal sequence, that is, the effect of one variable on an outcome variable through an intermediate variable. The simple mediation model representing the relation between an independent variable X, a mediator variable M, and an outcome variable Y can be represented by the following regression Equations (1), (2), and (3):

$Y_{i} = β_{01} + c X_{i} + e_{1 i}$
1

$M_{i} = β_{02} + a X_{i} + e_{2 i}$
2

$Y_{i} = β_{03} + c^{'} X_{i} + b M_{i} + e_{3 i,}$
3

where $β_{01}$ , $β_{02}$ , and $β_{03}$ are intercepts, c is the total effect of X on Y, a quantifies the effect of X on M, b quantifies the effect of M on Y adjusted for the effect of X, c $'$ is the direct effect of X on Y accounting for the effect of M on Y, and $e_{1 i}$ , $e_{2 i}$ , and $e_{3 i}$ are the unexplained or error variability. The e_i error terms are assumed to be uncorrelated. In a sample estimate, the notation for a, b, c, and $c'$ would be $\hat{a}, \hat{b}, \hat{c}, and \hat{c'}$ , respectively, where $\hat{}$ denotes estimates of the parameters. Equation (1) represents the total effect model, which involves X causing Y, as illustrated in Figure 1A. Mediation occurs when M comes between X and Y such that X affects M and then M affects Y, as illustrated in Figure 1B and represented by Equations (2) and (3). In the presence of a mediator, the indirect effect of X on Y through M, also called the mediated effect, is equal to the product ab.¹ (The terms mediated effect and indirect effect will be used synonymously.) Judd and Kenny (1981) proved that the mediated effect can also be calculated using ab = c $-$ c $'$ , without missing data and assuming M and Y are continuous and normally distributed.

Figure 1.
(A) Path diagram for the total effect model. (B) The single mediator model.

Assumptions

Statistical significance of the mediated effect alone is insufficient to make causal inferences, however, as $\hat{a} \hat{b}$ is only an unbiased estimate of the true causal effect when a number of assumptions are met. These assumptions include that X, M, and Y are measured without error, that each variable is measured at the correct time and temporal precedence is maintained, that the model includes the correct causal sequence, and that no confounders or other influences are omitted from the analyses; in the case where X is randomized, this means no confounders of the M to Y relationship have been omitted (James & Brett, 1984; MacKinnon, Fairchild, & Fritz, 2007). Since causal inference is only possible when all of these assumptions are met, testing these assumptions, such as conducting sensitivity analyses to quantify the potential impact of omitted confounders, is an essential piece of testing for mediation (Cox et al., 2013; Fritz et al., 2016; Tingley et al., 2014; VanderWeele, 2010).

Tests of the Mediated Effect

Judd and Kenny (1981) originally proposed the causal steps approach to detect mediation, where $\hat{c}$ , $\hat{a}$ , and $\hat{b}$ are tested for significance, in that order, and $\hat{c^{'}}$ must be nonsignificant to establish full mediation. Kenny et al. (1998) further noted that in order to establish mediation, only $\hat{a}$ and $\hat{b}$ need to be significant. This is called the joint significance test (JST). MacKinnon et al. (2002) examined Type I error rates and statistical power of 14 different tests of mediation under three different categories (causal steps, difference in coefficients, and product of coefficients). They found the JST to have the best balance between Type I error rates and statistical power compared to the other causal steps approaches. A drawback to using the JST is that confidence intervals (CIs) cannot be directly calculated because no estimate of $\hat{a} \hat{b}$ nor any measure of its standard error exists.

Calculating CIs for a Mediated Effect

CIs, on the other hand, can be constructed and have been recommended to statistically test for mediated effects (e.g., Bollen & Stine, 1990; MacKinnon et al., 2004; Sobel, 1982). One method is to assume the data are distributed normally and use the z-distribution such that CIs are calculated using Equation (4):

$CI : \hat{a} \hat{b} \pm z_{TypeIerror} \times {\hat{σ}}_{\hat{a} \hat{b}}$
4

where $\hat{a} \hat{b}$ is the mediated effect, z _TypeIerror is the z critical value on a standard normal distribution corresponding to the specified $α$ Type I error rate, and ${\hat{σ}}_{\hat{a} \hat{b}}$ is an estimate of the standard error of the mediated effect, such as Sobel’s (1982) first-order standard error. Using Equation (4), a symmetric confidence limit (i.e., the estimated mediated effect lies in the exact center of the confidence limit) is calculated.

The sampling distribution of the product of two random variables, such as the mediated effect $\hat{a} \hat{b}$ , has been shown to be skewed under certain conditions, so symmetric confidence limits are often biased (Bollen & Stine, 1990; Kisbu-Sakarya et al., 2014; MacKinnon et al., 2002, 2004; Meeker et al., 1981). Kisbu-Sakarya et al. (2014) further investigated how the skewness of the distribution of the product affects CI coverage and imbalance. Coverage pertains to the number of times the true mediated effect falls within the CIs, while imbalance is the number of times the true mediated effect falls to the left or right of the CIs. As skewness increases, coverage decreases while imbalance increases. MacKinnon, Fritz, et al. (2007) developed a test of mediation involving the calculation of asymmetric CIs with a Fortran program known as PRODCLIN. The program uses standardized values of $\hat{a}, \hat{b}$ , and their standard errors, and numerical integration to calculate asymmetric CIs for different Type I error levels. This method, however, is limited to the single mediator case and involves a specific computer program.

Monte Carlo method

The Monte Carlo (MC) method is another way to calculate CIs for the indirect effect (MacKinnon et al., 2004; Preacher & Selig, 2012). The MC method assumes a joint normal sampling distribution between the a and b parameters. The estimates $\hat{a}$ and $\hat{b}$ , along with their respective standard errors ${\hat{σ}}_{\hat{a}}$ and ${\hat{σ}}_{\hat{b}}$ , dictate the parameter values of this joint distribution. In order to create CIs using the MC method, 1,000 values of ${\hat{a}}^{}$ and ${\hat{b}}^{}$ are randomly drawn from this distribution. The sampled values of ${\hat{a}}^{}$ and ${\hat{b}}^{}$ are then multiplied to form ${\hat{a}}^{} {\hat{b}}^{}$ , and these generated values of ${\hat{a}}^{} {\hat{b}}^{}$ form a sampling distribution. The percentiles (determined by the $α$ -level) of the sampling distribution make up the CIs of the indirect effect.

Resampling

Another way to create asymmetric CIs for the mediated effect is by resampling. The act of returning each case to the sample is known as sampling with replacement, which allows the sampling distribution to be created. Given an original sample of size n, a new sample of size n is drawn from the original sample. Every time a case is drawn, it is returned to the sample so that each case has an equal chance of being selected again. For example, suppose an original dataset consists of [1, 2, 3, 4, 5]. On the first draw, 3 is randomly selected and then returned to the sample so that the probability of drawing 3 again remains 1/5. A sample could then be [3, 3, 4, 5, 2], where each case in the original dataset has a 1/5 probability of being selected for each draw. Resampling to create new samples is repeated a large number of times (usually upward of 1,000 new samples are taken). The statistic of interest is calculated from every sample, and the calculated statistic from the samples are then used to create an empirical sampling distribution. For example, given 1,000 samples, the mean of each sample can be calculated, resulting in 1,000 means. These 1,000 means could then be plotted as a sampling distribution.

Percentile bootstrap

A special case of sampling with replacement is the bootstrap—its name derived from the idea of “pulling oneself up by one’s bootstrap” (Efron & Tibshirani, 1993, p. 5). With bootstrapping, generation of a sampling distribution is based entirely on resampling from the original sample. With statistical mediation analysis, the original sample will consist of data for variables X, M, and Y for each participant. When a bootstrap sample is drawn from the original data, the corresponding X, M, and Y values for each participant are together considered one case. Therefore, instead of resampling each separate data value, bootstrapping for mediation analysis can be thought of as resampling participants. The mediated effect $\hat{a} \hat{b}$ is calculated for each bootstrap sample using Equations (2) and (3). The bootstrapped mediated effect estimates will also be denoted ${\hat{a}}^{} {\hat{b}}^{}$ for consistency. The ${\hat{a}}^{} {\hat{b}}^{}$ values are used to form the bootstrap sampling distribution. This bootstrap sampling distribution approximates the actual sampling distribution for $\hat{a} \hat{b}$ , which can be used to generate CIs and test for statistical significance. There are many different variations of bootstrapping. We discuss two in particular that are used to calculate CIs: the percentile bootstrap and the BC bootstrap. The simpler form of bootstrapping discussed here is the percentile bootstrap where ${\hat{a}}^{} {\hat{b}}^{}$ from each of the bootstrap samples are ordered from smallest to largest, and the exact percentiles corresponding to the set $α$ -level are used as the upper and lower bounds. A 95% CI is calculated by finding the exact values at the 2.5th and 97.5th percentiles on the bootstrap sampling distribution.

BC bootstrap (z ₀)

Theoretically, when all assumptions for statistical mediation analysis are met, the indirect effect is an unbiased estimator of the true indirect effect. When discussing bias in this paper, the focus is on the bootstrap sampling distributions instead of the indirect effect estimate. Bootstrap sampling distributions of $\hat{a} \hat{b}$ tend to be skewed and asymmetric, particularly for small sample sizes (Bollen & Stine, 1990; MacKinnon & Dwyer, 1993; MacKinnon, Fritz, et al., 2007; MacKinnon et al., 1995; Mallinckrodt et al., 2006; Shrout & Bolger, 2002; Stone & Sobel, 1990). Calculating symmetric CIs for asymmetric distributions results in biased CIs and low statistical power; the distribution of the product is ignored and not enough true positives are included within the CIs (MacKinnon & Dwyer, 1993; Shrout & Bolger, 2002; Stone & Sobel, 1990). The BC bootstrap corrects for bias, unlike the percentile bootstrap. Since each bootstrap sample is expected to have different frequencies of the original values, measures of central tendency for the bootstrap sampling distribution are not expected to equal the estimated value of $\hat{a} \hat{b}$ . Instead, the measure of central tendency for the bootstrap sampling distribution can fall either to the left or the right of the estimated value of $\hat{a} \hat{b}$ to some degree. To illustrate, suppose $\hat{a} \hat{b} =$ 5 and the median of the bootstrap sampling distribution is equal to 1; the original estimate $\hat{a} \hat{b}$ falls to the right of the median for the ${\hat{a}}^{} {\hat{b}}^{}$ ‘s. Bias is then equal to 4 (the difference between 5 and 1).

The endpoints of the CIs can be recalculated to correct for bias using the bias-correction, ${\hat{z}}_{0}$ . Bias here is defined as the proportion of ${\hat{a}}^{} {\hat{b}}^{}$ in the bootstrap sampling distribution that falls below the original estimate $\hat{a} \hat{b}$ . The proportion is converted to a z-score on the standard normal distribution, such that

${\hat{z}}_{0} = Φ^{- 1} (\frac{# \{{\hat{a}}^{} {\hat{b}}^{} < \hat{a} \hat{b}\}}{B})$
5

where $Φ^{- 1} (\cdot)$ is the inverse function of a standard normal cumulative distribution function, B is the number of bootstrap samples, and $# \{{\hat{a}}^{} {\hat{b}}^{} < \hat{a} \hat{b}\}$ is the number (#) of bootstrap replications that are less than $\hat{a} \hat{b}$ . The following equations from Efron and Tibshirani (1993) are used to define the BC confidence limits:

$BC CI : (\hat{a} {\hat{b}}_{l o}, \hat{a} {\hat{b}}_{u p}) = ({\hat{a}}^{} {\hat{b}}^{ (ω_{1})}, {\hat{a}}^{} {\hat{b}}^{ (ω_{2})})$
6

where

$ω_{1} = Φ (2 {\hat{z}}_{0} + z^{(α / 2)})$
7

$ω_{2} = Φ (2 {\hat{z}}_{0} + z^{(1 - α / 2)})$
8

given that $Φ^{- 1} (\cdot)$ is the standard normal cumulative distribution function, ${\hat{z}}_{0}$ is the correction for bias for the indirect effect, and $z^{(α / 2)}$ and $z^{(1 - α / 2)}$ are the ( $α / 2) \times 100$ and ( $1 - α / 2) \times 100$ percentile points of a standard normal distribution. Specifically, when $α = .05$ for a two-tailed test, the lower confidence bound is calculated using $2 {\hat{z}}_{0} - 1.96$ , and the upper confidence bound is calculated using $2 {\hat{z}}_{0} + 1.96$ . In essence, ${\hat{z}}_{0}$ is an estimate of the median bias by providing a measure of how far off the median of the bootstrap sampling distribution is from the original $\hat{a} \hat{b}$ . Without bias in the bootstrap sampling distribution, the proportion of ${\hat{a}}^{} {\hat{b}}^{}$ below $\hat{a} \hat{b}$ is .50, which results in ${\hat{z}}_{0} = 0$ . Therefore, if no bias is present, the BC bootstrap and percentile bootstrap CIs will be equivalent.

Issues Surrounding Current BC Bootstrap

Shrout and Bolger (2002) recommended the use of bootstrap methods for assessing mediation. They illustrated that the BC bootstrap has increased power to assess mediation for small sample sizes (N = 80 and N = 46 in their studies). Results of MacKinnon et al. (2004), in which a number of resampling methods for assessing mediation were compared, corroborated Shrout and Bolger’s (2002) findings. Although the BC bootstrap has been found to have relatively higher statistical power (see also Hayes & Scharkow, 2013; Preacher & Hayes, 2008; Williams & MacKinnon, 2008), the Type I error rate has also been found to be elevated in certain conditions. MacKinnon et al. (2004) reported that BC bootstrap Type I error rates for small sample sizes (i.e., N = 25, 50, and 100) when the effect size for either a or b $\neq$ 0 while the effect size for the other regression coefficient was zero were outside Bradley’s (1978) liberal robustness interval ( $α \pm 0.5 α$ ). In parsing the anomalous results, Fritz et al. (2012) found that Type I error rates occur as an interaction effect between path size and sample size, where Type I error rates are elevated with small sample sizes given a medium or larger effect size of the nonzero path (either a or b). Cheung (2007) found coverage for BC bootstrap overall to be close to .95, except when the effect size was small and N = 100. Efron and Tibshirani (1993) note that it is not easy to get a “good” estimate of z ₀ (p. 327). Based on this, Fritz et al. (2012) suggested the need to find a better estimate of bias.

Current Study

Rather than attempt to find a correction for z ₀ (i.e., correct a correction), the current study sought to find alternative measures of bias that could be implemented within the BC bootstrap that maintained the same levels of statistical power as Efron and Tibshirani’s z ₀, but were less prone to elevated Type I error rates for small samples. Given that z ₀ is based on the rationale that ${\hat{z}}_{0}$ measures the proportion of bootstrap estimates less than $\hat{a} \hat{b}$ and when that proportion is exactly 50% the amount of correction is zero, z ₀ is essentially correcting for bias by adjusting the limits of the bootstrap CI. When the original $\hat{a} \hat{b}$ is to the left of the median, ${\hat{z}}_{0}$ < 0 and the BC bootstrap CI will be adjusted to the left, whereas when the original $\hat{a} \hat{b}$ is to the right of the median, ${\hat{z}}_{0}$ > 0 and the BC bootstrap CI will be shifted to the right. The degree to which the BC bootstrap CIs are shifted will differ depending on the level of bias. Therefore, in order to find potential alternatives to z ₀ that are less affected by sample size, a review of different measures that quantify the magnitude and direction of the asymmetry of a distribution was conducted and three potential measures that could serve as alternatives to $z_{0}$ were selected.

Alternative corrections for bias

Mean ( $z_{m e a n})$

Efron and Tibshirani’s (1993) z ₀ converts the proportion of ${\hat{a}}^{} {\hat{b}}^{}$ less than $\hat{a} \hat{b}$ into a z-score on the standard normal distribution, which provides the corresponding bootstrap sampling distribution percentile as a correction for bias. Since ${\hat{z}}_{0}$ provides an estimate of how far $\hat{a} \hat{b}$ is from the median of ${\hat{a}}^{} {\hat{b}}^{}$ and in an asymmetric sampling distribution the mean and median will not be the same, the first correction measures how far the mean of ${\hat{a}}^{} {\hat{b}}^{}$ is from the median of ${\hat{a}}^{} {\hat{b}}^{}$ . Bootstrap replications that fall below the mean of the bootstrap sampling distribution, instead of the value $\hat{a} \hat{b}$ , will be calculated. The mean is included as an alternative measure of bias to test how much better, worse, or similar its performance would be to the original BC bootstrap since it is more affected by asymmetry than the median.

Winsorized mean

Given the skewed nature of the bootstrap sampling distribution, instead of discarding extreme values that may influence the mean in a certain direction, the Winsorized mean replaces extreme values with a predetermined value that will “trim” a percentage of values from the tails (Lix & Keselman, 1998). For example, given the values 1–10, the value 2 is the 10th percentile and 9 is the 90th percentile value. The 20% Winsorized mean is calculated by

$\frac{2 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 9}{10}$

where the most extreme values 1 and 10 have been replaced by the values 2 and 9, respectively, so that 20% of the data are trimmed. In conditions with non-normality and variance heterogeneity, Wilcox et al. (1998) demonstrated that combining trimmed means with bootstrapping resulted in better Type I error control. Wilcox (1995) suggests using a 20% level of trimming because it results in a reduction of the standard error without a loss in power.

This study will cover a progression of trimming levels in increments of 10. If 0% trimming occurs, the CI will be equivalent to using $z_{m e a n}$ . On the other hand, trimming 50% in each tail is equivalent to the percentile bootstrap. Therefore, 10% ( $z_{w i n 10}$ ), 20% ( $z_{w i n 20}$ ), 30% ( $z_{w i n 30}$ ), and 40% ( $z_{w i n 40}$ ) levels of trimming will be examined in order to incorporate a broad spectrum of the effect of different amounts of trimming on the estimate of bias. It is expected that the different levels of trimming will result in a progression of more accurate Type I error rates. The interest lies in which percentage of trimming would provide the most optimal Type I error rate and power, without the coverage rate being affected.

Medcouple ( $z_{m c}$ )

Medcouple is a robust measure of skewness that “measures the (standardized) difference between the distances of x_j and x_i to the median” (Brys et al., 2004, p. 998). Although the medcouple is not a direct measure of central tendency, it does use a measure of central tendency (the median) to help describe the skew of the distribution. Since the medcouple can be conceptualized as a type of standardized weighted median, the reasoning is to replace z ₀ with $z_{m c}$ to account for the skewness of the distribution. The medcouple is more robust to the effects of outliers than the classic measure of skewness. The median, m_n , is usually defined as

$m_{n} = \{\begin{matrix} \frac{x_{n / 2} + x_{n / 2 + 1}}{2} if n is even \\ x_{(n + 1) / 2} if n is odd . \end{matrix}$
9

The medcouple, introduced by Brys et al. (2003) is defined as

${MC}_{n} = \underset{x_{i} \leq m_{n} \leq x_{j}}{med} h (x_{i}, x_{j})$
10

where for all $x_{i} \neq x_{j}$ , the kernel function h is defined as

$h (x_{i}, x_{j}) = \frac{(x_{j} - m_{n}) - (m_{n} - x_{i})}{x_{j} - x_{i}} .$
11

Given two values $x_{j} and x_{i}$ , the difference between each of the values and the median is calculated. Then the difference between $(x_{j} - m_{n})$ and ( $m_{n} - x_{i}$ ) is calculated, and the value is divided by ( $x_{j} - x_{i})$ to standardize it. Given the denominator $(x_{j} - x_{i})$ , ${MC}_{n}$ will always lie between $- 1$ and 1. It is expected that $z_{m c}$ CIs will have Type I error rates similar to the percentile bootstrap but with higher power because it attempts to correct for skewness whereas the percentile bootstrap does not.

Alternative measures implementation

In order to compute the BC bootstrap test of mediation using these alternative corrections for bias, each estimate of bias will replace ${\hat{z}}_{0}$ in Equations (7) and (8), and the corresponding $ω_{1}$ and $ω_{2}$ will be used in Equation (6).

Materials and Method

Data Generation

To determine whether the proposed alternatives maintained more accurate Type I error rates compared to the BC bootstrap method, a simulation was performed using R (R Core Team, 2019). This study was an extension of Fritz et al. (2012), so the first four factors that were varied in the previous study were also varied in this study to focus on the effects of the bias adjustments and alternatives.

The first factor that was varied was the test of mediation. The original bias-correction, z ₀, was compared to $z_{m e a n}$ , $z_{w i n 10}$ , $z_{w i n 20}$ , $z_{w i n 30}$ , $z_{w i n 40}$ , and $z_{m c}$ . In addition, the JST, percentile bootstrap, and MC tests served as benchmark measures in order to serve as control methods for power and Type I error. The JST was included because Hayes and Scharkow (2013) found that researchers were still using it as a way to detect mediation. The MC method used 1,000 generated values for each replication. The second and third factors were the path effect sizes of a and b. Based on Cohen’s (1988) guidelines for small, medium, and large effect sizes, a and b were alternately set to 0, 0.14, 0.39, or 0.59, forming 16 different effect size combinations; seven with a null indirect effect (i.e., a, b, or both were equal to 0) and nine with a non-null indirect effect (i.e., neither a nor b were equal to 0). The path effect size for $c'$ was fixed to zero because the size of $c'$ does not affect the estimate of $\hat{a} \hat{b}$ (Fritz & MacKinnon, 2007). The fourth factor was sample size, selected to represent a range of sample sizes that may be encountered in the health sciences: 50, 100, 250, 500, and 1,000. Fritz et al. (2012) considered an additional sample size of 2,500 but found that as sample sizes approached 2,500, Type I error rates returned to .05 for all tests including the BC bootstrap using z ₀. The fifth factor that Fritz et al. (2012) examined, number of bootstrap samples, was not varied in this study due to the finding that the number of bootstrap samples did not affect Type I error rates. Therefore, the number of bootstrap samples was set at 1,000 for the present study.

The R function rnorm () was used to generate n X observations, dependent on the corresponding sample size. Values for M and Y were generated using the set values for a and b (as variations of the second and third factors) through Equations (2) and (3). Residuals were also generated using the rnorm () function for M and Y, and the sd argument incorporated the effect size of a and b as part of its calculations (i.e., the standard deviation for the distribution from which the residuals for M were drawn was calculated using $\sqrt{1 - a^{2}}$ , while the standard deviation for the distribution of residuals for Y was calculated using $\sqrt{1 - b^{2}}$ ). Each original sample was generated according to the parameters set by each of the varied factors described above, and the 95% CI for that sample was calculated. The original sample was bootstrapped 1,000 times, each method was applied, and the CIs were calculated for each bootstrap sample. The process was repeated 1,000 times, generating 1,000 replications. The R data generation and simulation syntax files are provided as online Supplemental Material.

Outcome Variables

Out of 1,000 replications for each combination of parameter values, rejection rate was the number of times zero was outside the CIs. The rejection rate is the Type I error when the population effect $a b = 0$ and statistical power when $a b \neq 0$ . Coverage was the number of times the true mediated effect fell within the CIs and was coded “0” when ab fell outside the CI and “1” when ab fell inside the CI. Imbalance was measured by the number of times the CI fell to the left or right of ab. A proportion for imbalance was calculated by dividing the number of times CIs fell to the left by the total number of CI misses (i.e., left misses/[right misses + left misses]). A value under .50 indicates that a larger proportion of CIs fell to the right of ab than to the left, while a value over .50 indicates that the CIs fell to the left more often. Syntax used to code and restructure (i.e., from wide to long format) simulation data are provided as online Supplemental Material.

Results

Additional tables and figures presenting the results from the following sections are presented as online Supplemental Material.

Type I Error

Figure 2 displays the Type I error rates of the alternative corrections for bias compared to the percentile bootstrap and z ₀. These methods were selected for presentation in this figure in order to highlight the similarities of z ₀, $z_{m c}$ , and $z_{m e a n}$ as well as the pattern of the Winsorized mean corrections. The percentile bootstrap method was included in this figure as it also represents a level of trimming with regard to Winsorized means (i.e., 50% trimming). The results for all benchmark methods (JST, MC, and percentile bootstrap) are presented in Supplemental Material Figure 1 along with four corrections for bias methods (namely z ₀, $z_{m e a n}$ . $z_{w i n 10}$ , and $z_{w i n 30}$ ). The four corrections for bias methods were selected for Supplemental Material Figure 1 to represent the spectrum of Winsorized mean trimming compared to the benchmark methods. Bradley’s (1978) liberal robustness criteria has a range of $α \pm 0.5 α$ , while the stringent criteria has a range of $α \pm 0.1 α$ . Therefore, the criteria to meet liberal robustness, given $α$ = .05, has the range [.025, .075]. The range for stringent robustness is [.045, .055]. The following results are streamlined to focus on the Type I error rates that fall outside of the liberal robustness criteria. Since the Type I error rate requires either a, b, or both to be equal to zero, the cases where b = 0 are reported in this section. The values are presented in Supplemental Material Table 1. When examining a = 0 and the different sizes of b, overall patterns between methods were similar to when b = 0. The results are presented in Supplemental Material Figure 2 but not discussed in order to avoid redundancy.

Figure 2.
Type I error rates by effect and sample size between the percentile bootstrap and corrections for bias methods. Note. Type I error rates are displayed as the effect of sample size and the effect size of a. Each of the four panels represents a different effect size of a. In this figure, b = 0 for all conditions. “percentile” = percentile bootstrap method, “z0” = z ₀, “zmean” = $z_{m e a n}$ , “zmc” = $z_{m c}$ , “zwin10” = $z_{w i n 10}$ , “zwin20” = $z_{w i n 20}$ , “zwin30” = $z_{w i n 30}$ , “zwin40” = $z_{w i n 40}$ . Values in the shaded regions fall within Bradley’s (1978) liberal robustness criteria. The gray horizontal line at 0.05 represents the targeted alpha level.

When a = 0 and b = 0, the Type I error rates for all methods fell outside the robustness criteria. Robustness across methods varied by sample size when a = .14. For n = 50, all methods were outside of the robustness range. As the sample size increased to n = 100, the methods JST, MC, percentile bootstrap, $z_{w i n 20}$ , $z_{w i n 30}$ , and $z_{w i n 40}$ all fell outside the robustness range. The opposite was observed for n = 500 where z ₀, $z_{m c}$ , and $z_{m e a n}$ were instead outside of the robustness range. In both cases, $z_{w i n 10}$ remained within the range. As the size of a increased, the robustness criteria were met more consistently by each of the methods. When a = .39 and n = 50, only z ₀ failed to meet the robustness criteria. Given n = 100, however, z ₀, $z_{m c}$ , $z_{m e a n}$ , $z_{w i n 10}$ , and $z_{w i n 20}$ all had Type I error rates outside the robustness range. Finally, when a = .59, Type I error rates all fell outside the robustness range when n = 50, with the exception of JST and MC.

Overall, z ₀, $z_{m c}$ , and $z_{m e a n}$ all had very similar Type I error rates. The size of the Type I error rates, however, fluctuated, as z ₀ had the largest type I error rate for some parameter combinations, while for others $z_{m e a n}$ was the largest. Together, z ₀, $z_{m c}$ , and $z_{m e a n}$ tended to either all be robust, or they were all outside the robustness range, depending on the parameter combination. As the level of trimming increased for the Winsorized means, Type I error rates decreased. Type I error rates for the percentile bootstrap were typically smaller than $z_{w i n 40}$ . Finally, $z_{m c}$ had higher Type I error rates than the percentile bootstrap method.

Power

Generally—and as expected—as sample size increased, power also increased for all tested methods. Differences in power were a function of sample and effect size. Power between the alternative corrections for bias compared to the percentile bootstrap and z ₀ is displayed in Supplemental Material Figure 3. Power for all benchmark methods (JST, MC, and percentile bootstrap) along with four corrections for bias methods (namely z ₀, $z_{m e a n}$ . $z_{w i n 10}$ , and $z_{w i n 30}$ ) are displayed in Supplemental Material Figure 4. Collapsing across levels of b did not change the overall patterns observed across methods. Therefore, the results for power collapsed across levels of b are reported in Supplemental Material Table 2. This section is split into two sections. The first section discusses conditions under which an adequate power of .80 occurred. The second section covers when power did not reach .80.

The parameter combinations under which all methods had power $\geq$ .80 are as follows. The first combination was with n = 1000, regardless of effect size. The second combination was when n = 500 and b = .39 or .59 across all levels of a. Third, there was adequate power when n = 500, a = .39, and b = .14. Fourth, all methods had adequate power when n = 100 or 250 and when a and b were both a medium or large effect size (i.e., a = b = .39; a = b = .59; a = .39, b = .59; and a = .59, b = .39). The fifth combination was when n = 50, a = .39 or .59, and b = .59. The only parameter combination in which only z ₀, $z_{m c}$ , $z_{m e a n}$ , and $z_{w i n 10}$ had adequate power (while the other methods did not) was when n = 500 and a = b = .14. Supplemental Material Figure 5 displays an isolated graph of power for a = b = .14 by sample size—the parameter combination with the most variation in power.

Power for all methods was below the desired .80 for n = 50, 100, or 250 when either a or b = .14, across all levels of the other. The second parameter combination where all methods were underpowered was when b = .39 and n = 50, across all levels of a. The final parameter combination where all methods were underpowered was when a = .59, b = .14, and n = 500.

The z ₀, $z_{m c}$ , and $z_{m e a n}$ methods all had similar power, though the order for most power varied by parameter combinations. For the Winsorized means as corrections for bias, as the level of trimming increased, power decreased. The $z_{m e a n}$ correction consistently had the most power, while $z_{w i n 40}$ had the least. Additionally, power for the percentile bootstrap was typically less than $z_{w i n 40}$ .

Coverage and Imbalance

Perfect coverage is considered as .95. Bradley’s (1978) robustness criteria for coverage has a range of [.925, .975]. Coverage for JST was not considered as it is not a method for constructing CIs and thus did not fit the coverage definition. Collapsing across levels of b did not change the overall observed patterns of coverage levels across methods (reported in Supplemental Material Table 3). As depicted in Supplemental Material Figure 6, coverage tended to converge toward .95 as sample size and effect size increased. The following section reports cases where the methods did not meet the robustness criteria for coverage. First, most coverage rates falling outside the range occurred when n = 50. When a = b = .14, the MC, percentile bootstrap, $z_{m c}$ , $z_{m e a n}$ , and $z_{w i n 40}$ all failed to meet the robustness criteria. Given a = .14 and b = .39, all methods (with the exception of MC and percentile bootstrap) fell outside the robustness range. For a = .14 and b = .59, only z ₀, $z_{m c}$ , $z_{m e a n}$ , and $z_{w i n 10}$ failed to meet the criteria. The same methods were outside the range when a = .39 and b = .14. When a = b = .14 and n = 100, z ₀, $z_{m c}$ , $z_{m e a n}$ , $z_{w i n 10}$ , and $z_{w i n 20}$ were outside the range.

Coverage levels for z ₀, $z_{m c}$ , and $z_{m e a n}$ were consistently similar. With the Winsorized mean methods, as trimming increased, coverage also increased for n = 50. The same pattern was observed with a = .14 and n = 100. Under these parameter combinations, the percentile bootstrap had the highest coverage compared to the trimmed means. Comparatively, $z_{m c}$ had higher coverage rates than the percentile bootstrap. As sample size increased and coverage converged toward .95, the differences between methods were not as distinguishable.

When the CI does not contain ab, the true mediated effect can either fall to the left or to the right of the CI. Similar to coverage, imbalance for JST was not considered. Imbalance results, collapsed across levels of b, are reported in Supplemental Material Table 4. Values above .50 indicate a larger proportion of CIs fell to the left of ab than the right, while values under .50 indicate more CIs fell to the right. The results for imbalance are displayed in Supplemental Material Figure 7. The methods tend to be imbalanced in the same direction (i.e., most were collectively either above or below .50 for a given parameter combination). The most fluctuation in imbalance was observed when a = b = 0. The fluctuation was due to the low Type I error rates for this effect size combination. Given how proportion for imbalance was calculated, misses only to the left with no misses to the right would equal a proportion of one. On the other hand, misses only to the right with no misses to the left would equal a proportion of zero. Therefore, with a small number of misses to begin with, the probability of the CIs only falling to either side of ab was high. Out of 720 effect size of a $\times$ effect size of b $\times$ sample size $\times$ method combinations, CIs in 420 cases fell to the left more often than to the right, 278 cases fell more often to the right, and 22 cases fell an equal number of times to either side of the true mediated effect.

Empirical Example

To illustrate the performance of the corrections for bias methods compared to the benchmark methods, these methods are applied to data from the Athletes Training and Learning to Avoid Steroids (ATLAS) program (Goldberg et al., 1996). The ATLAS program was designed to prevent high school football players from using anabolic androgenic steroids (AAS) by presenting healthy nutrition behaviors and appropriate strength training as direct alternatives to AAS use. Players received the ATLAS program over the course of the football season. Participants were measured at three time points (at the start of the football season, 3 months later at the end of the football season, and at a 1-year follow-up) on numerous potential mediating variables and three outcomes: intentions to use anabolic steroids, nutrition behaviors, and strength training self-efficacy. Because of the nature of bootstrapping, incomplete cases were deleted resulting in complete data for 731 students from Cohort 1 being used for the current example. In this example, the collected data will be treated as the population for illustration purposes only. As such, the use of the terms “power,” “Type I error rate,” and “performance” are only used under the premise of illustration. In practice, true population values would not be known to researchers so these terms would not be used in association with researchers’ collected sample data. Syntax are provided as Supplemental Material.

Power

MacKinnon et al. (2001) examined 12 possible mediators of the ATLAS program and found that the relation between participation in the ATLAS program (X) and a student’s intentions to use anabolic steroids measured 9 months after finishing the ATLAS program (Y) is mediated by a student’s perceived susceptibility to the adverse effects of steroid use immediately after completing the ATLAS program (M). The estimated values for the entire sample of 731 students are $\hat{a}$ = 0.595, $\hat{b}$ = −0.096, $\hat{a} \hat{b}$ = −0.057. The calculated CIs are reported in the top half of Table 1. None of the CIs contained zero, indicating there was a significant mediator. If we treat the sample of 731 as the population, then there is an effect and any nonsignificant test would be a Type II error.

Table 1.
Confidence Intervals From the Empirical Example.

Method Population 95% CI Sample 95% CI

Power

MC [−0.099, −0.022] [−0.680, 0.033]

PERC [−0.105, −0.024] [−0.684, 0.019]

$z_{0}$ [−0.110, −0.025] [−0.748, −0.007]

$z_{m c}$ [−0.111, −0.026] [−0.781, −0.020]

$z_{m e a n}$ [−0.111, −0.026] [−0.750, −0.010]

$z_{w i n 10}$ [−0.111, −0.025] [−0.736, −0.003]

$z_{w i n 20}$ [−0.109, −0.024] [−0.721, 0.002]

$z_{w i n 30}$ [−0.106, −0.024] [−0.709, 0.005]

$z_{w i n 40}$ [−0.105, −0.024] [−0.690, 0.013]

Type I Error

MC [−0.012, 0.011] [−0.044, 0.252]

PERC [−0.011, 0.011] [−0.013, 0.223]

$z_{0}$ [−0.012, 0.010] [0.000, 0.254]

$z_{m c}$ [−0.012, 0.010] [0.003, 0.262]

$z_{m e a n}$ [−0.012, 0.010] [0.003, 0.261]

$z_{w i n 10}$ [−0.012, 0.010] [−0.002, 0.249]

$z_{w i n 20}$ [−0.012, 0.010] [−0.004, 0.246]

$z_{w i n 30}$ [−0.012, 0.010] [−0.007, 0.242]

$z_{w i n 40}$ [−0.011, 0.011] [−0.010, 0.229]

Note. Confidence intervals produced by each tested method for the population compared to the randomly selected sample; CI = confidence interval, MC = Monte Carlo, PERC = percentile bootstrap method.

To illustrate the difference in performance across methods, a random subsample of 50 students was selected from the original sample. The estimated values for the subsample are $\hat{a}$ = 0.892, $\hat{b}$ = −0.300, $\hat{a} \hat{b}$ = −0.267. The MC, percentile bootstrap, $z_{w i n 20}$ , $z_{w i n 30}$ , and $z_{w i n 40}$ CIs all contain zero, indicating perceived susceptibility was not a mediator in the subsample, which is a Type II error. On the other hand, z ₀, $z_{m c}$ , $z_{m e a n}$ , and $z_{w i n 10}$ correctly indicated perceived susceptibility to be a significant mediator.

Type I Error

In contrast, MacKinnon et al. (2001) found the relation between participation in the ATLAS program (X) and a student’s nutrition behaviors measured 9 months after finishing the ATLAS program (Y) is not mediated by a student’s normative beliefs concerning steroid use immediately after completing the ATLAS program (M). The estimated values for the entire sample of 731 students are $\hat{a}$ = −0.153, $\hat{b}$ = 0.001, $\hat{a} \hat{b}$ = −0.0002. The indirect effect, though not exactly equal to zero, is treated as a null indirect effect to illustrate the different findings between methods. The calculated CIs for these variables are reported in the bottom half of Table 1. All CIs contained zero, indicating a nonsignificant mediator. Again, treating the sample of 731 students as the population, this means there is not an effect so any statistically significant test is a Type I error.

Taking a different random subsample of 50 students resulted in estimated values of $\hat{a}$ = −0.475, $\hat{b}$ = −0.152, $\hat{a} \hat{b}$ = 0.072. The benchmark methods, along with the Winsorized mean methods ( $z_{w i n 10}$ , $z_{w i n 20}$ , $z_{w i n 30}$ , and $z_{w i n 40}$ ) correctly indicated a nonsignificant mediator. The remaining corrections for bias methods (z ₀, $z_{m c}$ , and $z_{m e a n}$ ), however, indicated a significant mediator—a Type I error.

Discussion

The purpose of this study is to address the need to consider alternative corrections for bias that would produce CIs with more accurate Type I error rates without losing power. The original BC bootstrap test of mediation has elevated Type I error rates outside the robustness range in conditions with small sample sizes (i.e., n $\leq$ 100) and medium or large effect sizes of the non-zero path. Considering all effect sizes, the original BC bootstrap also has the highest power compared to the other methods where n $\leq$ 100. Increased power with elevated Type I error rates is observed once again, which corroborates results from previous studies (e.g., Cheung, 2007; Fritz et al., 2012; MacKinnon et al., 2004; Preacher & Selig, 2012). As with previous studies, the percentile bootstrap had the lowest Type I error rates along with the lowest power. Among the BC bootstrap methods, the medcouple and mean produced similar levels of Type I error rates and power compared to the original BC bootstrap. In the case of Winsorized means as corrections for bias, increasing trimming percentages decreased Type I error rates while also decreasing power. As evidenced by the results, Type I error and power are inextricably tied; results for one outcome cannot be analyzed and considered independently from the other. Furthermore, these two outcomes vary by sample and effect size. As effect and sample size increase, Type I error rates and power were more likely to reach ideal levels.

The results of this study can be informed by Kisbu-Sakarya et al. (2014) with regard to the skewness of the bootstrap sampling distribution. They found that an increase in skewness leads to a decrease in coverage. One plausible explanation for decreased Type I error rates with the Winsorized means was that effects of a skewed product distribution were mitigated by redistributing tail weights. Without skew in the product distribution, Winsorized means would be similar to the regular mean as the standard error would be reduced (Wilcox et al., 1998). Relatedly, for the Winsorized means as corrections for bias, power is greater than the percentile bootstrap because the sample size is preserved; instead of eliminating outliers, the outliers are assigned a floor or ceiling value. The measure of central tendency used to adjust the confidence limits thus falls somewhere between the mean as a correction for bias (which is similar to the original bias correction) and the percentile bootstrap method. As a result, the Type I error rates and power also fall between those of the mean correction and percentile bootstrap methods.

Recommendation

The simulation study provides the opportunity to control the true population effect sizes (i.e., the effect sizes for a, b, and ab) to observe the corresponding patterns for Type I error, imbalance, power, and coverage. In reality, it is not so simple for researchers to have estimates of the true population effect sizes. Sample size, however, is a factor researchers can account for when selecting a method for CI construction. Overall, the benchmark methods (i.e., joint significance test, Monte Carlo, and percentile bootstrap) had the most consistently robust Type I error rates. Most methods had robust Type I error rates as well as optimal power and coverage when the sample size was greater than or equal to 250, so concerning larger sample sizes, the method of choice will not be associated with different Type I error rates, power, or coverage. The main differences occur with sample sizes smaller than 250.

The recommendations made in this paper echo that of previous studies. If health professionals and researchers using intervention data have a small sample size and are most concerned about not committing a Type I error, it would be advisable to use a benchmark method. Specifically, the Monte Carlo method has robust Type I error rates and higher coverage rates than the other methods. On the other hand, if researchers are most concerned about not committing a Type II error, the original BC bootstrap has the highest power. Finally, using Winsorized means as alternative corrections for bias offers varying levels of compromise between the elevated Type I error rate of the original BC bootstrap and the low power of the benchmark methods. Depending on health professionals and researchers’ cost functions and research goals, different trimming levels may be considered. As the percentage of trimming increases, Type I error rate and power both decrease. Accordingly, Type I error and power for the 10% Winsorized mean are most similar to the original BC bootstrap, while the 40% Winsorized mean is most similar to the Monte Carlo method. Therefore, trimming percentages in between will offer a balance in Type I versus Type II tradeoff. Due to the nature of incorporating Winsorized means as a correction for bias, creating Winsorized BC bootstrap CIs is a multiple step process including first bootstrapping data before calculating the Winsorized mean. Different ways of calculating Winsorized means depend on the software. Example syntax for R and SAS are provided in Supplemental Appendices A and B, respectively. The 30% Winsorized mean is incorporated in the examples, but this level of trimming can be changed to fit the researcher’s needs.

Limitations and Future Directions

As with any study, there were limitations to the design and execution of this study. First, missing data were not considered as a factor. An assumption for this paper was that the data would not have any missingness. In practice, missing data is a common occurrence, so a next step is to explore what percentage of missingness would be acceptable for producing unbiased estimates. Since Fritz et al. (2012) did not find an effect of the number of bootstraps on Type I error, the number of bootstraps used for this paper was 1,000. A future study could consider testing different numbers of bootstraps, however, to examine whether the Winsorized means as corrections for bias would be affected.

The goal of this study was to determine if alternative corrections for bias would maintain increased statistical power of the BC bootstrap with small samples, while also maintaining nominal Type I error rates when calculating CIs for ab. Although this paper did not directly achieve this goal, given the relationship between power and Type I error, ideally both Type I error rate and power would be optimized. In this case, the Winsorized means did help to optimize both. Although the Winsorized means did not have as accurate Type I error rates as the Monte Carlo method or as good statistical power as the original BC bootstrap, the Winsorized means split the difference. Imbalance, however, as observed in previous research (e.g., Bollen & Stine, 1990; Kisbu-Sakarya et al., 2014; MacKinnon et al., 2004) was still present, regardless of which correction for bias was used. This suggests that lower statistical power observed in the BC bootstrap may be related to asymmetry. Trimming for Winsorized means is not required to be symmetric. Therefore, a natural extension for the Winsorized means is to consider asymmetric trimming using the Winsorized means for skewed data instead of trimming from both tails equally, which raises the question of whether Type I error rates and power could be further improved for mediation analysis.

Supplemental Material

Supplemental Material, sj-docx-1-ehp-10.1177_01632787211024356 - Comparing Alternative Corrections for Bias in the Bias-Corrected Bootstrap Test of Mediation

Supplemental Material, sj-docx-1-ehp-10.1177_01632787211024356 for Comparing Alternative Corrections for Bias in the Bias-Corrected Bootstrap Test of Mediation by Donna Chen and Matthew S. Fritz in Evaluation & the Health Professions

Supplemental Material

Supplemental Material, sj-r-1-ehp-10.1177_01632787211024356 - Comparing Alternative Corrections for Bias in the Bias-Corrected Bootstrap Test of Mediation

Supplemental Material, sj-r-1-ehp-10.1177_01632787211024356 for Comparing Alternative Corrections for Bias in the Bias-Corrected Bootstrap Test of Mediation by Donna Chen and Matthew S. Fritz in Evaluation & the Health Professions

Supplemental Material

Supplemental Material, sj-r-2-ehp-10.1177_01632787211024356 - Comparing Alternative Corrections for Bias in the Bias-Corrected Bootstrap Test of Mediation

Supplemental Material, sj-r-2-ehp-10.1177_01632787211024356 for Comparing Alternative Corrections for Bias in the Bias-Corrected Bootstrap Test of Mediation by Donna Chen and Matthew S. Fritz in Evaluation & the Health Professions

Method	Population 95% CI	Sample 95% CI
Power
MC	[−0.099, −0.022]	[−0.680, 0.033]
PERC	[−0.105, −0.024]	[−0.684, 0.019]
$z_{0}$	[−0.110, −0.025]	[−0.748, −0.007]
$z_{m c}$	[−0.111, −0.026]	[−0.781, −0.020]
$z_{m e a n}$	[−0.111, −0.026]	[−0.750, −0.010]
$z_{w i n 10}$	[−0.111, −0.025]	[−0.736, −0.003]
$z_{w i n 20}$	[−0.109, −0.024]	[−0.721, 0.002]
$z_{w i n 30}$	[−0.106, −0.024]	[−0.709, 0.005]
$z_{w i n 40}$	[−0.105, −0.024]	[−0.690, 0.013]
Type I Error
MC	[−0.012, 0.011]	[−0.044, 0.252]
PERC	[−0.011, 0.011]	[−0.013, 0.223]
$z_{0}$	[−0.012, 0.010]	[0.000, 0.254]
$z_{m c}$	[−0.012, 0.010]	[0.003, 0.262]
$z_{m e a n}$	[−0.012, 0.010]	[0.003, 0.261]
$z_{w i n 10}$	[−0.012, 0.010]	[−0.002, 0.249]
$z_{w i n 20}$	[−0.012, 0.010]	[−0.004, 0.246]
$z_{w i n 30}$	[−0.012, 0.010]	[−0.007, 0.242]
$z_{w i n 40}$	[−0.011, 0.011]	[−0.010, 0.229]

Footnotes

Acknowledgments

The authors would like to thank Jim Bovaird, Lorey Wheeler, and Karen Alexander for providing feedback for early drafts of this manuscript. Additional thanks to Jayden Nord for providing programming guidance.

Data Availability Statement

The syntax and the simulated data for this study are available as online supplemental materials. The empirical data that support the findings of this study are available from Dr. David MacKinnon and Dr. Linn Goldberg, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Drs. Goldberg and MacKinnon.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported in part by a grant from the National Institute on Drug Abuse (DA 009757).

ORCID iD

Donna Chen

Supplemental Material

Supplemental material for this article is available online.

Note

References

Bates

Eddelbuettal

(2013). Fast and elegant numerical linear algebra using the RcppEigen Package. Journal of Statistical Software, 52(5), 1–24. http://www.jstatsoft.org/v52/i05/

Bollen

K. A.

Stine

(1990). Direct and indirect effects: Classical and bootstrap estimates of variability. Sociological Methodology, 20, 115–140. https://doi.org/10.2307/271084

Bradley

J. V.

(1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31, 144–152. https://doi.org/10.1111/j.2044-8317.1978.tb00581.x

Brys

Hubert

Struyf

(2003). A comparison of some new measures of skewness. In Dutter

Filzmoser

Gather

Rousseeuw

P. J.

(Eds.), Developments in robust statistics (pp. 98–113). Physica. https://doi.org/10.1007/978-3-642-57338-5_8

Brys

Hubert

Struyf

(2004). A robust measure of skewness. Journal of Computational and Graphical Statistics, 13(4), 996–1017. https://doi.org/10.1198/106186004X12632

Cheung

M. W. L.

(2007). Comparison of approaches to constructing confidence intervals for mediating effects using structural equation models. Structural Equation Modeling: A Multidisciplinary Journal, 14(2), 227–246. https://doi.org/10.1080/10705510709336745

Cohen

(1988). Statistical power analyses for the behavioral sciences (2nd ed.). Erlbaum.

Cox

M. G.

Kisbu-Sakarya

Miočević

MacKinnon

D. P.

(2013). Sensitivity plots for confounder bias in the single mediator model. Evaluation Review, 37(5), 405–431. https://doi.org/10.1177/0193841X14524576

Efron

Tibshirani

R. J.

(1993). An introduction to the bootstrap. Chapman & Hall.

10.

Fritz

M. S.

Kenny

D. A.

MacKinnon

D. P.

(2016). The combined effects of measurement error and omitting confounders in the single-mediator model. Multivariate Behavioral Research, 51, 681–697. https://doi.org/10.1080/00273171.2016.1224154

11.

Fritz

M. S.

MacKinnon

D. P.

(2007). Required sample size to detect the mediated effect. Psychological Science, 18(3), 233–239. https://doi.org/10.1111/j.1467-9280.2007.01882.x

12.

Fritz

M. S.

Taylor

A. B.

MacKinnon

D. P.

(2012). Explanation of two anomalous results in statistical mediation analysis. Multivariate Behavioral Research, 47(1), 61–87. https://doi.org/10.1080/00273171.2012.640596

13.

Fuller-Rowell

T. E.

Curtis

D. S.

El-Sheikh

Duke

A. M.

Ryff

C. D.

Zgierska

A. E.

(2017). Racial discrimination mediates race differences in sleep problems: A longitudinal analysis. Cultural Diversity and Ethnic Minority Psychology, 23(2), 165–173. https://doi.org/10.1037/cdp0000104

14.

Goldberg

Elliot

Clarke

G. N.

MacKinnon

D. P.

Moe

Zoref

Green

Wolf

S. L.

Greffrath

Miller

D. J.

Lapin

(1996). Effects of a multidimensional anabolic steroid prevention intervention: The Adolescents Training and Learning to Avoid Steroids (ATLAS) program. Journal of the American Medical Association, 276(19), 1555–1562. https://doi.org/10.1001/jama.1996.03540190027025

15.

Hayes

A. F.

Scharkow

(2013). The relative trustworthiness of inferential tests of the indirect effect in statistical mediation analysis: Does method really matter? Psychological Science, 24(10), 1918–1927. https://doi.org/10.1177/0956797613480187

16.

Imai

Keele

Tingley

(2010). A general approach to causal mediation analysis. Psychological Methods, 15, 309–334. https://doi.org/10.1037/a0020761

17.

James

L. R.

Brett

J. M.

(1984). Mediators, moderators, and tests for mediation. Journal of Applied Psychology, 69(2), 307–321. https://doi.org/10.1037/0021-9010.69.2.307

18.

Jorgensen

T. D.

Pornprasertmanit

Schoemann

A. M.

Rosseel

(2019). semTools: Useful tools for structural equation modeling. R package Version 0.5-2. https://CRAN.R-project.org/package=semTools

19.

Judd

C. M.

Kenny

D. A.

(1981). Process analysis: Estimating mediation in treatment evaluations. Evaluation Review, 5(5), 602–619. https://doi.org/10.1177/0193841X8100500502

20.

Kenny

D. A.

Kashy

Bolger

(1998). Data analysis in social psychology. In Gilbert

Fiske

Lindzey

(Eds.), Handbook of social psychology (4th ed., pp. 233–265). McGraw-Hill.

21.

Kisbu-Sakarya

MacKinnon

D. P

Miočević

(2014). The distribution of the product explains normal theory mediation confidence interval estimation. Multivariate Behavioral Research, 49(3), 261–268. https://doi.org/10.1080/00273171.2014.903162

22.

Lix

L. M.

Keselman

H. J.

(1998). To trim or not to trim: Tests of location equality under heteroscedasticity and nonnormality. Educational and Psychological Measurement, 58(3), 409–429. https://doi.org/10.1177/0013164498058003004

23.

Lomnicki

Z. A.

(1967). On the distribution of products of random variables. Journal of the Royal Statistical Society: Series B, 29(3), 513–524. https://doi.org/10.1111/j.2517-6161.1967.tb00713.x

24.

Lundgren

Dahl

Hayes

S. C.

(2008). Evaluation of mediators of change in the treatment of epilepsy with acceptance and commitment therapy. Journal of Behavioral Medicine, 31, 225–235. http://doi.org/10.1007/s10865-008-9151-x

25.

MacKinnon

D. P.

Dwyer

J. H.

(1993). Estimating mediating effects in prevention studies. Evaluation Review, 17, 144–158. https://doi.org/10.1177/0193841X9301700202

26.

MacKinnon

D. P.

Fairchild

A. J.

Fritz

M. S.

(2007). Mediation analysis. Annual Review of Psychology, 58, 593–614. https://doi.org/10.1146/annurev.psych.58.110405.085542

27.

MacKinnon

D. P.

Fritz

M. S.

Williams

Lockwood

C. M.

(2007). Distribution of the product confidence limits for the indirect effect: Program PRODCLIN. Behavior Research Methods, 39(3), 384–389. https://doi.org/10.3758/BF03193007

28.

MacKinnon

D. P.

Goldberg

Clarke

G. N.

Elliot

D. L.

Cheong

Lapin

Moe

E. L.

Krull

J. L.

(2001). Mediating mechanisms in a program to reduce intentions to use anabolic steroids and improve exercise self-efficacy and dietary behavior. Prevention Science, 2(1), 15–28. https://doi.org/10.1023/A:1010082828000

29.

MacKinnon

D. P.

Lockwood

C. M.

Hoffman

J. M.

West

S. G.

Sheets

(2002). A comparison of methods to test mediation and other intervening variable effects. Psychological Methods, 7(1), 83–104. https://doi.org/10.1037//1082-989X.7.1.83

30.

MacKinnon

D. P.

Lockwood

C. M.

Williams

(2004). Confidence limits for the indirect effect: Distribution of the product and resampling methods. Multivariate Behavioral Research, 39(1), 99–128. https://doi.org/10.1207/s15327906mbr3901_4

31.

MacKinnon

D. P.

Valente

M. J.

Gonzalez

(2020). The correspondence between causal and traditional mediation analysis: The link is the mediator by treatment interaction. Prevention Science, 21, 147–157. https://doi.org/10.1007/s11121-019-01076-4

32.

MacKinnon

D. P.

Warsi

Dwyer

J. H.

(1995). A simulation study of mediated effect measures. Multivariate Behavioral Research, 30(1), 41–62. https://doi.org/10.1207/s15327906mbr3001_3

33.

Mallinckrodt

Abraham

W. T.

Wei

Russell

D. W.

(2006). Advances in testing the statistical significance of mediation effects. Journal of Counseling Psychology, 53(3), 372–378. https://doi.org/10.1037/0022-0167.53.3.372

34.

Maruska

Hansen

Hanewinkel

Isensee

(2016). The role of substance-specific skills and cognitions in the effectiveness of a school-based prevention program on smoking incidence. Evaluation & the Health Professions, 39(3), 336–355. https://doi.org/10.1177/0163278715588825

35.

McManus

Surawy

Muse

Vazquez-Montes

Williams

J. M. G.

(2012). A randomized clinical trial of mindfulness-based cognitive therapy versus unrestricted services for health anxiety (hypochondriasis). Journal of Consulting and Clinical Psychology, 80(5), 817–828. https://doi.org/10.1037/a0028782

36.

Meeker

W. Q.

Jr Cornwell

L. W.

Aroian

L. A.

(1981). Selected tables in mathematical statistics, volume VII: The product of two normally distributed random variables. American Mathematical Society.

37.

Pearl

(2001). Direct and indirect effects. In Breese

Koller

(Eds.), Proceedings of the 17th conference on uncertainty in artificial intelligence (pp. 411–420). Morgan Kaufmann.

38.

Preacher

K. J.

Hayes

A. F.

(2004). SPSS and SAS procedures for estimating indirect effects in simple mediation models. Behavior Research Methods, Instruments, & Computers, 36(4), 717–731. https://doi.org/10.3758/BF03206553

39.

Preacher

K. J.

Hayes

A. F.

(2008). Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models. Behavior Research Methods, 40(3), 879–891. https://doi.org/10.3758/BRM.40.3.879

40.

Preacher

K. J.

Selig

J. P.

(2012). Advantages of Monte Carlo confidence intervals for indirect effects. Communication Methods and Measures, 6(2), 77–89. https://doi.org/10.1080/19312458.2012.679848

41.

R Core Team. (2019). R: A language and environment for statistical computing (Version 3.6.2) [Computer Software]. R Foundation for Statistical Computing. https://www.R- project.org/

42.

Revelle

(2018). psych: Procedures for personality and psychological research. Northwestern University. https://CRAN.R-project.org/package=psychVersion=1.8.12

43.

Segaert

Hubert

Rousseeuw

Raymaekers

(2019). mrfDepth: Depth measures in multivariate, regression and functional settings. R package Version 1.0.11. https://CRAN.R-project.org/package=mrfDepth

44.

Sella

Sader

Lolliot

Kadosh

R. C.

(2016). Basic and advanced numerical performances relate to mathematical expertise but are fully mediated by visuospatial skills. Journal of Experimental Psychology: Learning, Memory, and Cognition, 42(9), 1458–1472. https://dx-doi-org.web.bisu.edu.cn/10.1037/xlm0000249

45.

Shrout

P. E.

Bolger

(2002). Mediation in experimental and nonexperimental studies: New procedures and recommendations. Psychological Methods, 7(4), 422–445. https://doi.org/10.1037/1082-989X.7.4.422

46.

Sobel

M. E.

(1982). Asymptotic confidence intervals for indirect effects in structural equation models. Sociological Methodology, 13, 290–312. https://dx-doi-org.web.bisu.edu.cn/10.2307/270723

47.

Springer

M. D.

Thompson

W. E.

(1966). The distribution of products of independent random variables. SIAM Journal on Applied Mathematics, 14(3), 511–526. https://doi.org/10.1137/0114046

48.

Stone

C. A.

Sobel

M. E.

(1990). The robustness of estimates of total indirect effects in covariance structure models estimated by maximum likelihood. Psychometrika, 55, 337–352. https://doi.org/10.1007/BF02295291

49.

Tallman

B. A.

Altmaier

Garcia

(2007). Finding benefit from cancer. Journal of Counseling Psychology, 54(4), 481–487. https://doi.org/10.1037/0022-0167.54.4.481

50.

Tingley

Yamamoto

Hirose

Keele

Imai

(2014). Mediation: R package for causal mediation analysis. Journal of Statistical Software, 59(5), 1–38. https://doi.org/10.18637/jss.v059.i05

51.

VanderWeele

T. J.

(2010). Bias formulas for sensitivity analysis for direct and indirect effects. Epidemiology, 21, 540–551. https://doi.org/10.1097/EDE.0b013e3181df191c

52.

Wickham

(2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag.

53.

Wickham

(2017). tidyverse: Easily install and load the ‘Tidyverse’. R package version 1.2.1. https://CRAN.R-project.org/package=tidyverse

54.

Wilcox

R. R.

(1995). ANOVA: The practical importance of heterscedastic methods, using trimmed means versus means, and designing simulation studies. British Journal of Mathematical and Statistical Psychology, 48, 99–114. https://doi.org/10.1111/j.2044-8317.1995.tb01052.x

55.

Wilcox

R. R.

Keselman

H. J.

Kowalchuk

R. K.

(1998). Can tests for treatment group equality be improved?: The bootstrap and trimmed means conjecture. British Journal of Mathematical and Statistical Psychology, 51, 123–134. https://doi.org/10.1111/j.2044-8317.1998.tb00670.x

56.

Williams

MacKinnon

D. P.

(2008). Resampling and distribution of the product methods for testing indirect effects in complex models. Structural Equation Modeling: A Multidisciplinary Journal, 15(1), 23–51. https://doi.org/10.1080/10705510701758166

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.04 MB

0.01 MB

Comparing Alternative Corrections for Bias in the Bias-Corrected Bootstrap Test of Mediation

Abstract

Keywords

The Single-Mediator Model

Assumptions

Tests of the Mediated Effect

Calculating CIs for a Mediated Effect

Monte Carlo method

Resampling

Percentile bootstrap

BC bootstrap (z 0)

Issues Surrounding Current BC Bootstrap

Current Study

Alternative corrections for bias

Mean ( z m e a n )

Winsorized mean

Medcouple ( z m c )

Alternative measures implementation

Materials and Method

Data Generation

Outcome Variables

Results

Type I Error

Power

Coverage and Imbalance

Empirical Example

Power

Type I Error

Discussion

Recommendation

Limitations and Future Directions

Supplemental Material

Supplemental Material, sj-docx-1-ehp-10.1177_01632787211024356 - Comparing Alternative Corrections for Bias in the Bias-Corrected Bootstrap Test of Mediation

Supplemental Material

Supplemental Material, sj-r-1-ehp-10.1177_01632787211024356 - Comparing Alternative Corrections for Bias in the Bias-Corrected Bootstrap Test of Mediation

Supplemental Material

Supplemental Material, sj-r-2-ehp-10.1177_01632787211024356 - Comparing Alternative Corrections for Bias in the Bias-Corrected Bootstrap Test of Mediation

Footnotes

Acknowledgments

Data Availability Statement

Declaration of Conflicting Interests

Funding

ORCID iD

Supplemental Material

Note

References

Supplementary Material

BC bootstrap (z ₀)

Mean ( $z_{m e a n})$

Medcouple ( $z_{m c}$ )