Abstract
A new two-sample test for comparing variability measures is proposed. To make the test robust and powerful, a new modified structural zero removal method is applied to the Brown–Forsythe transformation. The t-test-based statistic allows results to be expressed as the ratio of mean absolute deviations from median. Extensive simulation study demonstrates that the proposed test is robust to small or unequal sample sizes across many distributions. Moreover, careful exploratory analysis provides a new method for calculating the implicit association test scores for reaction time data with multiplicative treatment effects. Using this, a possible difference between variability of men and women’s implicit attitudes toward gay men is analyzed.
1. Introduction
Comparing means to characterize group differences is common, but sometimes variability differences between groups are more meaningful. For instance, age groups differ by variability in sleep spindles and rapid eye movements (Peters, Ray, Fogel, Smith, & Smith, 2014). Likewise, certain mental disorders are indicated by variability differences; larger intravariability in reaction times (RTs) is observed among individuals with schizophrenia and attention deficit hyperactivity disorder (Smyrnis et al., 2009; Tamm et al., 2012; Theleritis, Evdokimidis, & Smyrnis, 2014). Additional applications in archaeology, environmental sciences, economics, medical research, and legal studies are also documented by Gastwirth, Gel, and Miao (2009). The variety of potential applications motivates the main goal of our article, which is to develop a reliable test for checking the equality of variances or other robust measures of variability.
The popular variance ratio test (also known as the F test) is highly sensitive to nonnormal distributions (Box, 1953; Shoemaker, 2003). Therefore, many attempts have been made to develop a hypothesis test for comparing variability measures that is robust to nonnormal distributions, outliers, and small or unequal sample sizes. One of the most popular tests is developed by Levene (1960), which compares variability measures among samples based on the absolute deviation from the sample mean. His transformation, referred to as Levene’s transformation (Neuhäuser, 2007), makes the resulting hypothesis test nonrobust when the underlying distribution is skewed.
Brown and Forsythe (1974) suggested a more useful modification that employs the absolute deviation from the sample median, which we refer to as the Brown–Forsythe transformation. Through simulations, they showed that their transformation makes the test more robust to skewed distributions. Their results were confirmed by a large simulation study in Conover, Johnson, and Johnson (1981) and theoretically by Carroll and Schneider (1985).
The Brown–Forsythe transformation, however, makes the test very conservative when the sample sizes are small and odd. This weakness was demonstrated in various simulation studies (Carroll & Schneider, 1985; Conover, Johnson, & Johnson, 1981; Lim & Loh, 1996; Martin, 1976). Conover et al. (1981) and Hines and O’Hara Hines (2000) identified that the weakness is largely due to the presence of a “structural zero” (the transformed observation which is equal to zero, as one observation is exactly equal to the sample median). Removing the structural zero makes the test less conservative and more powerful.
Another line of research examines the bias of the sample mean of the Levene- or Brown–Forsythe-transformed normal random variables to the standard deviation as functions of sample sizes. To obtain a more reliable hypothesis test or confidence interval in the unequal sample size cases, several correction factors have been suggested (Bonett & Seier, 2003; Keyes & Levy, 1997; O’Brien, 1978; O’Neill & Mathews, 2000). Moreover, Noguchi and Gel (2010) considered the zero-correction method that combines the structural zero removal method of Hines and O’Hara Hines (2000) and the correction factor suggested by O’Brien (1978) for the Brown–Forsythe transformation. A modified version of the zero-correction method, which we explain in detail in the following section, tends to make the test both robust and powerful under unequal or odd sample sizes and skewed or heavy-tailed distributions.
Considering the independent two-sample case in particular, Marozzi (2011) compared a number of two-sample tests and recommended the Brown–Forsythe test as well as several resampling-based methods including the bootstrap-based test of Hall and Padmanabhan (1997) and Pan (1999). However, Bonett and Seier (2003) claimed that recent modifications of the classical F test suggested by Shoemaker (2003), the bootstrap-based test of Hall and Padmanabhan (1997), and the delta method–based test of Pan (1999) are not robust with respect to violations of distributional assumptions or unequal sample sizes. Specifically, they have pointed out the nonrobustness of the abovementioned tests when the samples are drawn from nonidentical distributions. To address these issues, they have developed their own method in the form of a confidence interval and showed that theirs is much more robust with respect to such violations.
Therefore, we aim at developing a robust independent two-sample test for comparing variability measures, which is at least as competitive as that of Bonett and Seier (2003), using the idea of Pan (1999). To do so, we incorporate ideas of the Brown–Forsythe transformation, structural zero removal, new correction factor, and improved effective degrees of freedom based on Noguchi and Marmolejo-Ramos (2016) in our test. Our novel correction factor, based on the results of O’Neill and Mathews (2000), is more suitable for the Brown–Forsythe transformation than the previously suggested correction factors. Moreover, we modify the zero-correction method of Noguchi and Gel (2010) by taking into account the simulation results of Loh (1987). Then, we show through carefully planned sets of simulations that our test is at least as robust and powerful as that of Bonett and Seier (2003), outperforming in a number of situations especially when the sample sizes are small or odd.
To demonstrate an application of the proposed test, we conduct a thorough reanalysis of the data set used in Lemm (2006). The reanalysis offers new insights into characterization of gender differences based on variability in implicit attitudes toward gay men (ATGM). To obtain a reliable conclusion, we also develop a robust method for calculating the implicit association test (IAT) scores for the RT data with multiplicative treatment effects and errors. Using the results from the proposed test based on the robust IAT scores, we show how the differences in variability may imply a gender difference in mean ATGM conditional on the levels of close contact by considering the contact hypothesis of Allport (1954).
This article is organized as follows. In Section 2, we suggest a modified zero-correction method based on a new correction factor and structural zero removal method. In Section 3, we discuss the new test statistic and its Welch–Satterthwaite type small-sample approximation. In Section 4, we discuss the performance of the proposed test in terms of the empirical size and power for various sample size and distribution combinations, which are plausible for comparing RT data. Then, in Section 5, we propose a robust IAT score calculation method, called log ratio analysis, for the multiplicative treatment effect. Based on the RT data used in Lemm (2006), we examine the differences between men and women in terms of mean and variability of the implicit ATGM scores. Lastly, in Section 6, we offer concluding remarks and mention possible future extensions.
2. Modified Zero-Correction Method
We start by explaining a modified zero-correction method, which is essentially a combination of the structural zero removal method of Hines and O’Hara Hines (2000) and an appropriate novel correction factor in the unequal sample size case. By devising the modified zero-correction method, the proposed test that will be described later becomes less conservative and more powerful for any sample size combinations.
Let
To understand how
Even though several approximations for
Comparison of the Three Approximations of
Note. a Modified zero-correction in this article; b Method of O’Brien (1978); c Method of Bonett and Seier (2003).
Another improvement that we discuss in this article is a modification of the structural zero removal method briefly suggested in Conover et al. (1981) and discussed in detail by Hines and O’Hara Hines (2000). When ni is odd, there is an observation
Based on the observations above, we suggest an approach of combining the structural zero removal method and the new correction factor Modified Zero-Correction Method:
When ni is odd: Remove the structural zero ( Let
When ni is even:
Let
As a remark, the constant multiple
3. Proposed Test Statistic and Its Small-Sample Approximation
The test statistic we consider in this article is based on the t statistic with the delta method applied to
To come up with a hypothesis testing procedure for checking the equality of
where
and that “∻” denotes “approximately distributed as.” As shall be seen in Section 4, Equation 2 is useful to well approximate the distribution of Equation 1 under
For a small-sample approximation, although Pan (2002) suggested a Student’s t distribution with
Based on the idea above, an approximate
where
After exponentiation, the confidence interval in Equation 6 becomes
which is a
Simulation Study
We have conducted a simulation study to check the robustness and power of the proposed test. Both the simulated size and power of the tests are calculated by generating 10,000 data sets from various distributions. In particular, the normal, Student’s t, gamma, Weibull, and ex-Gaussian distributions are used. The four nonnormal distributions are chosen with three different input parameters, representing a wide variety of sampling distributions with skewness and kurtosis in the ranges of [0.00, 2.00] and [2.50, 9.00], respectively. The parameters are chosen such that the resulting distributions are suitable for representing real world data, such as RTs, which are typically modeled by one of the abovementioned distributions. Table 2 lists the chosen parameters and resulting distribution details. For the convenience of presenting the results, these 13 distributions are divided into three groups based on similarity (Group 1: symmetric, Group 2: moderately skewed, Group 3: highly skewed and heavy tailed). Because each sample may come from one of the 13 distributions, in total, we have considered
Distribution Details for the Simulation Study
For a comprehensive evaluation, we have selected 16 different combinations of sample sizes
The performance of the proposed test is compared to the test suggested by Bonett and Seier (2003), originally presented in terms of confidence intervals. The Bonett–Seier (B-S) test statistic is a suitable benchmark test as they have favorable robustness properties compared to other popular tests for comparing variability measures (Bonett & Seier, 2003). The B-S test statistic is given by
where
For both the size and power calculations, we let
Figures 1 and 2 show that the proposed test maintains the nominal size of the test very well across different distribution and sample size combinations. However, the proposed test seems slightly liberal when there is a large imbalance in the sample sizes. On the other hand, the B-S test tends to be somewhat conservative, especially when at least one of the sample sizes is equal to 5. That is, at least one of the sample size groups is 1 in Figure 2. The odd and small sample size is less problematic for the proposed test as it benefits from the modified zero-correction method.

Sizes of the test by different distribution group combinations.

Sizes of the test by different sample size group combinations.
Figures 3 and 4 summarize the power comparisons between the proposed test and the B-S test. To make sure that the size of the test is very close to the nominal level of

Power comparisons of the tests by different distribution group combinations for n1 ≥ 35 and n2 ≥ 35. Dark gray: The proposed test is at least 1% more powerful than the Bonnet–Seier test.

Power comparisons of the tests by different distribution group combinations for n1 ≥ 35 and n2 ≥ 35. Dark gray: The proposed test is at least 1% more powerful than the Bonnet–Seier test.
Based on these observations, the proposed test displays robustness to unequal or small sample sizes and a wide range of skewness and kurtosis. Moreover, the proposed test appears to be more robust and powerful than the B-S test, especially when the underlying distributions are symmetric or moderately skewed. Therefore, our test seems applicable in various situations where the differences in variability measures are the parameters of interest to researchers.
5. Real-Life Application
5.1. The IAT
To demonstrate an application of the proposed test, we investigate data from a study by Lemm (2006) on examining relationships between implicit and explicit ATGM using IAT. Implicit associations are representative of automatic cognitive processes taking place outside of one’s awareness. As such, they must be measured indirectly, as opposed to explicit attitudes that are frequently measured through tools like self-report questionnaires. The IAT has become a standard in measuring implicit associations through RTs since its initial introduction by Greenwald, McGhee, and Schwartz (1998). It has been heavily employed in the field of social psychology to understand attitudes and biases.
To measure implicit associations, the IAT requires subjects to categorize sets of stimuli into one of four categories. These stimuli represent dichotomies from two dimensions. One dimension is the target concept, which highlights the phenomenon under investigation. For example, studies examining smoking behavior use stimuli representing exercise and smoking for the target concept (Perugini, 2005), while studies examining racial bias may use stimuli representing people who are Black or White (Wittenbrink, Judd, & Park, 2001). The second dimension represents the evaluative attribute. The dichotomy of pleasant/unpleasant is commonly used for the attribute domain.
The IAT compares RTs between trials with different pairings of these dimensions, where the two pairings are referred to as critical blocks. For example, in Perugini (2005), one critical block paired responses for “smoking” and “pleasant” to the same response key, while “exercise” and “unpleasant” were paired to another key. The second critical block swaps this pairing. Subjects who associate a given pairing more strongly (i.e., a congruent pairing) tend to make faster and more accurate categorizations of stimuli compared to incongruent pairings. Larger differences in critical block RTs indicate a greater difference in congruence between these pairings, resulting in a larger IAT score, whose calculation will be explained later. Interpretation of the IAT scores typically indicates the preference or bias of a subject by noting which target concept has lower RTs when paired with the pleasant attribute.
5.2. Lemm’s RT Data on ATGM
The study by Lemm (2006), which we reexamine in this section, analyzes RT data taken from
5.3. The IAT Score Calculation
To better understand the traditional IAT score calculation of Greenwald et al. (1998) used in Lemm (2006), let
As a remark, Greenwald, Nosek, and Banaji (2003) suggested steps for calculating an improved IAT score, commonly referred to as the D score. The D score is analogous to the popular Cohen’s d effect size measure, which attempts to generate more reliable results by standardizing the mean difference. Although the interested reader is referred to the second and third columns of Table 4 in their paper for the details, notable differences from the traditional calculation include removing RTs below 400 milliseconds and measuring the standardized mean difference between the two types of parings without log transformation.
Among the findings based on the traditional IAT score calculation, Lemm (2006) noted that internal motivation to respond without prejudice (Plant & Devine, 1998) and having positive contact with gay men had the most significant relationship to lower IAT scores, indicating less prejudice toward gay men. Although no significant difference in the mean IAT scores between men and women was observed, Lemm (2006) interpreted the result with caution because of the possible influence of having a sample of students from the liberal Pacific Northwest. Nevertheless, it is possible that other gender differences in IAT scores may exist, which may be better described by a more suitable IAT score calculation method. Specifically, we consider taking the possible multiplicative nature of RTs into account.
5.4. The IAT Score Calculation for the Multiplicative RT Pattern
The D score is well established within the research community as a standard IAT score calculation method. In particular, if each subject’s RTs follow an additive pattern (i.e., RTs from one critical block tend to differ from the other by an added constant) and additive error, the D score, which compares the difference between block types, is well justified mathematically. However, this approach may be less appropriate for nonlinear RT patterns. Also, note that the study conducted by Greenwald et al. (2003) to show the superiority of the D score is based on a very large sample size (
To address the assumption regarding the RT response pattern, we may consider the multiplicative RT pattern mentioned in Salthouse and Hedden (2002) instead. Under the multiplicative RT pattern, latencies between treatments differ by a positive constant multiple rather than an additive constant. If we consider the multiple IAT tasks and responses to stimuli from the target concept and evaluative attribute as distinct, then the data set from Lemm’s (2006) study may also reveal a multiplicative response pattern. Indeed, RTs of subjects revealed some features suggestive of this pattern (see Figure 5). That is, delayed responses are typically amplified by some constant factor for incongruent pairings compared to the congruent ones. In addition, for the data showing a multiplicative pattern, it is natural to assume a multiplicative error (cf. Noguchi, Aue, & Burman, 2016). Because the raw data on the left-hand side are contaminated with noise, they are smoothed using the moving average filter and displayed on the right-hand side. To accommodate a multiplicative RT pattern with a multiplicative error, calculation of the IAT scores using the ratio of RTs between blocks instead of taking their differences may be more desirable.

Subjects suggesting a multiplicative RT pattern. Gray: Incongruent pairings. Black: Congruent pairings.
To construct an IAT score calculation method for the multiplicative RT pattern, as before, let
To answer the question of whether the log ratio analysis method is reliable compared to the D score method, we have conducted a simulation study assuming a multiplicative error structure without any presence of an effect (for details, see the Supplemental Material available in the online version of this article). The results suggest that the log ratio analysis method appears to be more reliable, ensuring that the simulated effect sizes are kept reasonably close to zero when the effect is absent. On the other hand, the results from the D score method might not be reliable as there seems to be a high probability of producing unacceptably high effect sizes even when the effect is absent.
5.5. Mean and Variability Comparisons Using the IAT Scores
As mentioned before, the goal of this analysis is to investigate a possible gender difference in the variability of the IAT scores. Let
Table 3 summarizes the results of mean and variability comparisons using the two types of the IAT score calculations. Consistent with Lemm’s (2006) original findings, no significant difference in the mean IAT scores is observed between men and women. Of primary interest, however, is the possibly significant difference found in IAT variability between men and women. All three tests indicate some evidence for the difference in the IAT score variability between men and women using D scores. However, as we discussed previously, for the data having a number of extreme observations and showing a possibly multiplicative RT pattern, it may make more sense to use the log ratio analysis method. Based on the log ratio analysis method, the statistical significance has reduced compared to the D score method, implying that the statistically significant result for the D score method may be affected by extreme observations. Nevertheless, the p values of .092 and .098 for our test and the B-S test, respectively, may offer some evidence for a possible gender difference explained by the multiplicative effects.
In addition, we have calculated Cohen’s d effect size of the sample means (for the mean comparisons) and sample mean absolute deviations from median (for the variability comparisons). Using the common effect size interpretation of .2 (small), .5 (medium), and .8 (large) for Cohen’s d, we note that the effect is small for the mean comparisons and small to medium for the variability comparisons by looking at the log ratio analysis (see Table 3). That lends support to our finding that the difference in the IAT score variability may be more significant than the difference in the mean IAT scores.
Summary of the p Values and Effect Sizes Comparing IAT Scores for Men and Women
5.6. Interpretation of the Results
The results about the possible differences in variability may be interpreted in terms of the contact hypothesis of Allport (1954) and the gender difference in the network of supportive or close friends. The contact hypothesis essentially states that interaction with stigmatized group members can lead to reduced prejudice and stereotyping among dominant group members, under certain circumstances (Lemm, 2006). Also, studies have shown that female heterosexuals are more likely than male heterosexuals to know a gay person (Herek & Capitanio, 1996) and that late adolescent girls have both a significantly greater number of supportive or close friends than boys (Colarossi, 2001; Feiring, 1999). The studies by Colarossi (2001) and Feiring (1999) also showed an increase in variability in the number of supportive or close friends for late adolescent girls than boys, suggesting a positive mean–variance relationship in the close contact variable. Thus, we postulate that females are more likely to have a greater variability in having close relationships with gay people than males, possibly contributing to the greater variability in the implicit ATGM.
However, it should also be examined that the results in Table 3 showed no statistically significant difference regarding the mean implicit ATGM between men and women, similar to the previous research on the explicit ATGM by Herek and Capitanio (1996) based on a national sample. For now, let us assume that both males and females with the same level of contact with gay people have the same mean level of implicit ATGM. Then, assuming a positive mean–variance relationship in the close contact variable with gay people, the implicit ATGM should be lower for the female group than the male group given the contact hypothesis, on the contrary to what the results suggest.
That is, it seems more natural to postulate that males and females may have different mean levels of implicit ATGM, given that they have the same level of contact with gay people. Further investigations on the relationship between the level of contact with gay people and implicit ATGM may shed light on a deeper understanding of the gender difference. Additional considerations of the results can be found in the online Supplemental Material.
6. Conclusion
We have discussed a new test for comparing the variability measures in the independent two-sample case. Specifically, the test is made robust and powerful by considering the modified structural zero removal method with a new correction factor on the Brown–Forsythe transformation. Moreover, noting that the Brown–Forsythe-transformed variables are positive, the test statistic utilizes the ratio of their means by devising the delta method with the log transformation. To ensure that the test performs reliably even when the sample sizes are small, odd, or unequal, a modification of the Welch–Satterthwaite type small-sample approximation is presented, properly accounting for the log transformation in the test statistic. The comprehensive simulation results covering many RT data scenarios suggest that the proposed test seems to be more robust and powerful under various sample size and distribution configurations.
Another contribution we have made in this article is the log ratio analysis method, which could be highly useful for the IAT score calculations when the data show a multiplicative treatment effect and a multiplicative error with a number of extreme observations. Through a careful reexamination of the data set analyzed by Lemm (2006), we have proposed that such RT data may exist, and more reliable IAT scores may be calculated by considering the median of the log ratio of the RTs in the two different blocks for each individual. The statistical analysis using the proposed test indicates some possibility of the difference in the IAT score variability between men and women. Based on that, we postulated that this finding may be related to the difference in mean implicit ATGM between men and women conditional on the same level of contact with gay people. Further investigations on the relationship between the level of contact with gay people and implicit ATGM are worthwhile as they may contribute to a deeper understanding of the gender difference. All the suggested procedures in this article and data are implemented in an R package “intervcomp,” and examples are provided in the online Supplemental Material.
Although the main purpose of the article was to develop a robust test for comparing variability measures of two independent samples, it is possible to extend the approach to the multisample cases. Specifically, noting that researchers often have a set of specific comparisons they wish to make, a modification of the multiple contrast test suggested in Hothorn, Libiger, and Gerhard (2012) can be considered for future work.
Supplemental Material
Supplementary_Material - A Robust Test for Checking the Homogeneity of Variability Measures and Its Application to the Analysis of Implicit Attitudes
Supplementary_Material for A Robust Test for Checking the Homogeneity of Variability Measures and Its Application to the Analysis of Implicit Attitudes by Ryan C. Erps and Kimihiro Noguchi in Journal of Educational and Behavioral Statistics
Footnotes
Acknowledgments
The authors are grateful to two anonymous reviewers and the Editor for their helpful comments that have led to substantial improvement in the quality of the article. The authors are also grateful to Kristi Lemm for providing the RT data and to Fernando Marmolejo-Ramos for helpful comments.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
