Mann-Whitney U test and t -test

Available accessEditorialFirst published online January, 2023

Mann-Whitney U test and t -test

Volume 117, Issue 1https://doi.org/10.1177/0145482X221150592

Abstract

In this issue of the Journal of Visual Impairment & Blindness (JVIB), the article entitled, “The effect of a training video with audio description on the breast self-examination of women with visual impairments,” by Çelik and İldan Çalım, notes that some of their primary quantitative measures do not satisfy the requirements for normality, so they are not distributed along a bell curve (for more background on this topic, review the Statistical Sidebar from the May-June issue of 2018). Because of this issue, they chose a non-parametric equivalent to the parametric statistical test they would typically use.

Let us unpack that statement a bit. At its core, a parametric statistical test is one that assumes the dependent variable is normally distributed and bases its calculations on means and standard deviations. There is also a requirement that there should be enough scores in each group being compared so that the means and standard deviations are not swayed too much by outliers in the data. What constitutes “enough scores” is a matter of debate. Some say there should be at least 30 scores in each group, others simply say more is better than less. More is better than less, but you can get away with fewer than 30 scores in a group if the spread of scores in your data is small, you have tight controls on extraneous variables influencing your data, and your study design and analytical approach has additional safeguards against the influence of outliers. But these matters have more to do with study design than statistical analysis.

If your data fail to satisfy the basic requirements of normal distribution, as in the article under discussion in this sidebar, then a parametric statistical test is not appropriate, because parametric tests are based on means and standard deviations. They assume that the central tendency of the normal curve (eg, most scores tend to be in the center of the curve) is in play and driving the mean of the groups being compared. If this assumption is not true, then the theoretical curve behind a group of scores is skewed to one side or the other of the curve and the mean is no longer the best estimate for representing the group of scores. Nonparametric tests were developed to accomplish the same kinds of group comparisons that parametric tests are capable of doing, but without relying on means to do them.

In the article under consideration, if the Champion's Health Belief Model (CHBM) scale scores had conformed to the normal distribution parameters, the authors would have used an independent groups t-test to compare the data of two groups: the group of women who watched the video with audio description and the group who watched the video without description. Since there are two groups, a t-test would have been used; since different women were in the two groups, the independent groups version would have been used. But, since the CHBM scores were not normally distributed, the authors used the Mann-Whitney U test as the nonparametric alternative, which gives us the chance to see how the two tests are related.

The t-test equation

$t = \frac{\bar{x_{1}} - \bar{x_{2}}}{s_{p} \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}}$

simply means that you find the mean of the first group and subtract the mean of the second group, then divide that by the pooled standard deviation multiplied by that large square root, which might look scary, but it is calculated by just plugging in the number of scores in the first group and the number of scores in the second group and running through the math.

You may also be wondering, What is this “pooled standard deviation”? It is a number that is calculated to take into consideration the standard deviations of both of the groups of scores. It is also a fairly simple calculation made up only of the standard deviations from both groups and the number of scores in each group. That equation is:
$s_{p} = \sqrt \frac{(n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2}}{n_{1} + n_{2} - 2}$
which is a bigger and scarier square root, but it is calculated by just plugging in the number of scores in group one and group two and the standard deviation of group one and group two in the appropriate places.

All of this information is meant to demonstrate how the parametric t-test is based on means and standard deviations. If your data do not conform to a normal distribution—thus a mean is not the best estimate of a group of scores—this test will give you a result that is off-base by some unknown amount.

Next, let us see how the Mann-Whitney U test addresses the problem. Instead of a t value, the Mann-Whitney U test gives a U value. How each of these values is or is not statistically significant is a discussion for another time. To get the U statistic, instead of using means, we use ranks by ordering the scores from both groups from smallest to largest, keeping track of which group each score came from, then adding up the ranks for the scores from each group. If group one had both of the lowest scores, that would be a 1 and a 2 going into the sum for group one. If both groups have scores that are identical, their rank in the order is a tie. Let us imagine that both groups have a score that is the smallest when all the scores are ranked. In that case, each of those scores get assigned a rank of 1.5, since they essentially tie and fill the rank one and rank two spots.

When you have ranked all the scores and calculated a sum of ranks for each of the two groups, you plug those numbers, along with the number of scores in each group into the following formulas:
$\begin{matrix} U_{1} = n_{1} n_{2} + n_{1} (n_{1} + 1) / 2 – R_{1} \\ U_{2} = n_{1} n_{2} + n_{2} (n_{2} + 1) / 2 – R_{2} \end{matrix}$
These formulas give a U value for group one and another for group two. You take the smaller of the two U values, and that is your result.

When compared to a table of U values, taking into consideration the group sizes, it would tell us whether our results are statistically significant or not, much as one would do when comparing the t values from the t-test to a different table of t values. In the end, the two sets of scores from two separate groups of women can be compared using two analogous statistical tests: one based on means and standard deviations, and one based on ranks of scores.