Abstract

There are plenty of figures and diagrams out there that offer ways to categorize statistical tests and help experimenters choose what test best fits their data. In this Stastical Sidebar, I will reference some of those relationships, but I hope to also illustrate why those relationships are as they are. To a large degree, the range of statistical tests that are available comes from the fact that how the independent and dependent variables are coded (and how many of each there are) affects the mathematical equations that are necessary to draw inferences about the relationships among the variables. When I refer to “how variables are coded,” I mean whether the variables are categorical or continuous. One could argue for further categorization into ordinal, but I will keep things as simple as I can for now. Categorical data fall into named bins, like months of the year or political parties or income deciles. Continuous data can have values (potentially) that are anywhere between two numbers, like scores on a mathematics test or how high a person can jump or the number of people who can fit in a telephone booth.
I would like to start explaining the relationships between statistical tests by looking at a case in which there is one continuous independent variable and one continuous dependent variable. With this starting point, a lot of the other major statistical tests can be seen as special cases or variations. In this particular case, simple linear regression would be used. If one thinks of graphing the relationship between these two continuous variables, there would be a scatterplot with a straight regression line drawn through the best fit of the scattered dots. The statistical test simply performs the mathematics to describe that best fit and determine how likely that relationship is to arise from chance. From this stating point, researchers could complicate the analysis design a bit by having more than one independent variable, in which case they would use multiple linear regression and add the independent variables one at a time to see how each affected the prediction equation. (It should be noted that there are different ways to accomplish this task, but I am keeping things simple in this sidebar.) There is also a whole class of statistical testing above regression called hierarchical linear modeling (HLM). Without getting into specifics, HLM can be thought of as having a series of regressions, with each one treated as a separate variable in a larger statistical test.
Let me return to the starting place of linear regression. If researchers keep a single continuous dependent variable, but make the independent variable categorical, they can simplify the mathematical equations, and the result becomes an analysis of variance (ANOVA). There are different types of ANOVAs, but the intention of this Statistical Sidebar is to provide a broad overview of this process. Imagine the regression graph I mentioned before, if one plotted the same data, but the independent variable had only three groups or categories, then the regression line might be the same, but all of the data points would line up vertically along three lines representing the three independent variable groups. Dropping the number of categories down to two, one might be able to simplify the mathematics even more, and the result would be a t-test. It is possible to conduct an ANOVA with two groups, which might be helpful, depending on the data, but the point of this sidebar is to illustrate the general relationship between different tests. The t-test can be thought of as a simplified version of an ANOVA, when there are only two groups in the independent variable.
If one imagines a hierarchy of statistical tests in which HLM is at the top, below it is linear regression, then ANOVA, and finally with t-tests at the bottom, you start to realize that they are all related but their calculations become more simplified as one moves down the hierarchy and as the complexity of the dataset simplifies. From this hierarchy, one could branch left or right to include other groups of statistical tests. On one side would be situations where one increases the number of dependent variables. For regression, if a researcher added dependent variables, he or she would use multivariate linear regression, which takes into consideration the interplay between the predicted variables, as well as the variables predicting them. For ANOVA, the tests become MANOVAs when dependent variables are added, since the "M" stands for “multivariate.” On the other side, if the dependent variable was categorical instead of continuous, the researcher would move to other related tests. With a continuous independent variable, one would use binomial logistic regression (multinomial if the dependent variable had more than 2 categories), whichwould replace most of the regression and ANOVA approaches with a continuous dependent variable. And finally, if there were categorical independent and dependent variables, one would use a chi square test (the non-parametric equivalent of the t-test).
It goes without saying that there are many other special cases of statistical tests and special situations and different ways of approaching analyzing data, but the description in this sidebar is intended to give readers a general idea of how the major classes of statistical tests relate to one another. In the vertical hierarchy of tests with a continuous dependent variable, there is a mathematical link, where each step down the hierarchy can be thought of a special case or simplified version of the approach just above it. Then, adding dependent variables are changing the independent or dependent variables from continuous to categorical moves you into other special cases. There is no one-size-fits all approach to statistical analysis, and I hope that this ranking of approaches helps readers as they think about ways in which to examine the data they have collected.
