Abstract
It is widely believed that regression models for binary responses are problematic if we want to compare estimated coefficients from models for different groups or with different explanatory variables. This concern has two forms. The first arises if the binary model is treated as an estimate of a model for an unobserved continuous response and the second when models are compared between groups that have different distributions of other causes of the binary response. We argue that these concerns are usually misplaced. The first of them is only relevant if the unobserved continuous response is really the subject of substantive interest. If it is, the problem should be addressed through better measurement of this response. The second concern refers to a situation which is unavoidable but unproblematic, in that causal effects and descriptive associations are inherently group dependent and can be compared as long as they are correctly estimated.
Introduction
This article is about the interpretation of binary response models when making group comparisons. It is widely believed that there is a problem in making such interpretations. This is expressed by Allison (1999:187) in the following terms: “Differences in the degree of residual variation across groups can produce apparent differences in coefficients that are not indicative of true differences in causal effects.” The implication of this is that when we apply standard methods to estimate the effects of treatments on binary outcomes, comparisons of the relative sizes of effects between different groups or between analyses with different explanatory variables can be misleading.
These sound like serious problems, and Allison’s paper and others that have taken up and built upon his message have been influential. Data from Google scholar show that by February 28, 2017, Allison (1999), the earliest key contribution, had received 742 citations, Williams (2009) 262, and Mood (2010) a remarkable 1,053. Scholars not only cite these papers but also deploy their arguments to qualify the interpretation of their own data and to criticize the work of others. Platt (2009), for instance, in a review of Heath and Cheung (2007) takes the editors to task for the conclusions they allow to be drawn from intergroup comparisons using logit models saying: “…this would anyway fail to take account of the fact that such comparison involves assumptions about the equality of residual variances across models.” Marks (2014:175), taking his lead from Mood (2010), tells his readers that a particular technical point “…is important because it undermines conclusions from studies that have used logistic regressions that do not include relevant unobservables.” Sikora (2015:273) tells us that in her study: “To avoid problems inherent in comparing logit coefficients or odds ratios between groups…the key findings are presented as predicted probabilities that are supplemented with tabulated relative risk ratios…,” and Kleykamp (2013:847) warns us before she reveals her results that “…group comparisons expressed through interactions are problematic in non-linear models because such models cannot distinguish group coefficient differences from group differences in residual variation or unobserved heterogeneity.”
Clearly, it is now widely believed that there is a problem with using binary response regressions to make group comparisons and that the parameters routinely estimated in such endeavors—for instance, odds ratios, log odds ratios, and other quantities related to them—are not to be trusted. If this was true, the problem would indeed be even more dramatic than is typically acknowledged. Consider, for example, the pair of hypothetical studies summarized in Table 1. Here, we have a binary response variable
Hypothetical data from two randomized experiments, with two distinct groups of participants (labeled groups A and B).
Note: The participants have been assigned to one of two experimental conditions, indicated by a binary variable
In this article, we argue that this is not the case and that the so-called problem is not one that need concern most empirical researchers who wish to make group comparisons. Our view is that a lack of clarity about what the appropriate target quantities are that are to be estimated in particular empirical enquiries has led many researchers to draw the wrong conclusions from the literature we consider.
To avoid misunderstanding, we should say that our argument is not about the technical correctness of various authors’ exposition of how binary response models work. Their mathematics is correct. Our point is that some of the implications of what they then go on to conclude have, at the least, been misunderstood and misunderstood in ways that seem to suggest that sociologists do not always think as hard as they should about what the estimation target is that is most relevant for the substantive question they are trying to answer. Thinking clearly about what it is that is estimated in a binary response model should lead one to conclude that the problem of group comparisons is largely chimerical and that any remaining difficulties arise from expecting these techniques to do things they were never designed to do in the first place.
There are two distinct versions of the group comparison problem, which arise from two different meanings of “unobserved heterogeneity.” The first version is that comparative conclusions about effects which are estimated for a binary outcome can be wrong if we want to treat them as estimates of effects for an unobserved continuous outcome which is supposed to have been measured only by the binary variable. The second is that, even for a binary outcome, estimated effects of a treatment are not comparable between groups because the individuals in different groups have different distributions of other predictors of the outcome. These two versions of the issue have not always been clearly separated in the literature. For example, Allison (1999) focuses on the first version and Mood (2010) to a larger extent also on the second, but both draw on both versions for motivation of their discussions.
We argue that the first version of the group comparison problem only exists if we are genuinely interested in the unobserved continuous variable and that, if we are, the problem should be resolved by more serious efforts to measure this variable. The second version arises because estimable causal (as well as descriptive) effects are unavoidably group dependent—but this is not a problem or an error but an inherent part of what such effects mean. Some of these points have previously been made, although in somewhat different language, by Rohwer (2012) and Buis (2017).
In the rest of this article, we discuss the first version of the group comparison issue in the third section and the second version in the fourth section. In preparation for these two main sections, it is first necessary to clearly define what we mean by regression coefficients and their interpretation. This is done in the second section, and concluding comments are given in the fifth section.
Interpretation of Regression Coefficients
Introduction
In this section, we describe the interpretation of coefficients in some common regression models, to the extent that is needed to draw on in later sections. In the Linear Regression subsection, we begin with linear regression models, which serve as a point of reference for the binary models. Logit and other models for binary outcomes are discussed in the Logit Models for Binary Response Variables subsection. The latent-variable motivation of binary response models, which is central to the version of group comparisons considered in the third section, is then described separately in the Latent-variable Motivation of Models for Binary Responses subsection.
We will throughout focus on the simplest situations where the questions can be explained, omitting extraneous complications and variations. Firstly, we will consider the interpretation of regression coefficients as causal effects. This is closest to the spirit in which the question of group comparisons has been discussed in the literature. An alternative interpretation of the coefficients would be as descriptive measures of associations, in a sample or in a finite population. With appropriate modifications, parallel versions of all of our conclusions apply also to such descriptive interpretations.
We begin by considering models with only one explanatory variable, because the issues that are discussed in the third section can be described already in this context. Additional explanatory variables are relevant to the questions discussed in the fourth section, so they will be introduced there. We take the explanatory variables to be binary because this makes the interpretations particularly straightforward, but all of the conclusions apply also to models with continuous explanatory variables (effects of a continuous variable
For the moment, we thus consider a response variable
Since the issues that we discuss are about the meaning and interpretation of model parameters, we need not concern ourselves with details about how these parameters are estimated. For our purposes, it is sufficient to note that regression coefficients with causal interpretations can be validly estimated if the observed data are appropriate for this purpose. In particular, this is the case if the values of
Linear Regression
Suppose that
for
When
where
But what is it that
Suppose that we regard the
where
Under sufficiently strong conditions for the observed data, model (1) can be treated as a representation of the distributions of the potential outcomes and used to estimate the effect (3). Here,
Logit Models for Binary Response Variables
Suppose now that the response variable
for
where
The regression coefficient
the log odds ratio between
Latent-variable Motivation of Models for Binary Responses
The model formulation that will give rise to the first version of the group comparison problem is not a linear or a logit model on its own but in a sense a combination of them. This is the latent-response formulation of the logit model, interpreted as a linear model.
Let
where
for every
which also implies that
What implications does this reasoning have for the interpretation of the regression coefficients? Very often, none at all. This is the case whenever the binary
In some applications, however, the latent
This is a useful and powerful result, when it holds, but it does come with some cost. First, the regression results can only be given on a standardized scale where the residual variance of
Group Comparisons for a Latent Continuous Response
Definition of the Problem
Suppose now that we want to compare regression results between two groups of units (comparisons across three or more groups add no new issues). We consider the same type of regression model for both groups and denote the coefficients of an explanatory variable
We assume that the data satisfy the necessary assumptions to allow for valid estimation of the coefficients within each group and that
This condition is satisfied if the model of interest is for a response variable
The situation is more difficult if we are in the setting of the Latent-variable Motivation of Models for Binary Responses subsection, that is, if we want to interpret coefficients from a model for a binary
in groups
Here, however, we observe not
The coefficient of
In the hypothetical studies in Table 1, we concluded that the effect of
Expressed in terms of the potential outcomes for
for data on
where
Solutions to the Group Comparison Problem
What, then, can be done to resolve this kind of group comparison problem? In this section, we describe four types of solutions: (1) conclude that for your research question the problem does not exist, (2) reparametrize the model to change which parameters can and cannot be identified, (3) choose to report quantities that are comparable across groups, or (4) improve the measurement of the response variable. We argue that (1) and (4) should be the most important solutions, even though the literature on this topic has focused on solutions (2) and (3).
The problem in the Definition of the Problem subsection applies to comparisons of models for the continuous response If it is seriously believed that there is some physical property more or less stably characterizing each organism [
Here, Berkson was discussing applications in bioassays where it might not seem very implausible to grant an independent existence to such latent tolerances. In our view, the situation is even clearer in the social sciences where convincing and necessary examples of them are rarer still.
To substantiate this claim, we examined a sample of 100 articles that cited the literature on the group comparison question, specifically the widely cited article by Mood (2010). 1 They are primarily from journals in the fields of sociology, political science, education, public health, social psychology, social policy, criminology, and demography. Of the 100 papers, 84 had a response variable that at the analysis stage was treated as a dichotomy, and these form the sample we discuss below. In terms of the contexts in which the group comparison literature is referred to, the sample includes two coherent nonmutually exclusive sets of interest to us. The first (30 percent) contains discussions of group comparisons (in the spirit of this section, and/or The Value of an Effect is Group dependent subsection below) and the second (45 percent) discusses comparisons of models with different explanatory variables (also bringing in issues from the Models With Different Covariates subsection). A third, extremely heterogeneous set (32 percent), concerns itself with neither of these but with other questions that are not the central focus of this article.
None of the articles made an explicit conceptual distinction between the observed response
Suppose now that we conclude, nevertheless, that models for a latent
All of these models with the same number of identification constraints are observationally equivalent, so they cannot be compared to each other in terms of goodness of fit. They can, however, give very different estimates of the parameters of interest. If models (10) and (11) hold, the estimate of the coefficient
If the model includes several explanatory variables, it is also possible to impose constraints on the coefficients of one or more of them while leaving both the rest of the coefficients and the residual variances separately estimable. This can also be done for several explanatory variables at once, for example assuming that the coefficients of all control variables are equal across groups but the coefficient of main interest and the residual standard deviation need not be. In this case, the model will include more than the minimum number of parameter constraints, and it will be possible to assess the appropriateness of some of the remaining constraints by testing them against less restrictive models. In the article where he introduced the group comparisons question, Allison (1999) proposed methods along these lines as solutions to the problem. However, any such comparison is still conditional on a specific set of assumed constraints, and we could always consider alternative ones that are equivalent in terms of fit but can produce very different estimates for the parameters of interest (similar comments are made by Williams 2009 based on a simulation study). In other words, all that such specifications can really do is to shift the assumptions of an inherently poorly identified model from one part of the model to another. This will not solve the group comparisons problem unless we are entirely convinced that a particular parameter constraint is substantively correct.
A different approach that has been proposed for obtaining meaningful group comparisons of models for a latent
This approach works well when describing and comparing standardized associations between
The root cause of the group comparison problem as formulated in the Definition of the Problem section—and of other difficulties with estimating models for
Substantively motivated latent variables are of course common in many social science applications. A general and more flexible measurement strategy for them is also familiar: use multiple observed indicators that are all regarded as imperfect measures of the latent variable, and define and estimate latent variable measurement models that represent this situation (see e.g., Bartholomew, Knott, and Moustaki 2011, for an overview of such models). In particular, measurement by multiple binary indicators
This is what we would recommend if models for a latent continuous
Group Comparisons for a Binary Response
The Value of an Effect is Group dependent
As discussed in the Introduction section, the group comparison problem has typically been discussed in terms of “unobserved heterogeneity.” In the Group Comparisons for a Latent Continuous Response section, we considered one version of this issue where the heterogeneity refers to the residual variability of a latent
As explained in the Linear Regression and Logit Models for Binary Response Variables subsections, a unit-level causal effect is defined by a comparison of the potential outcomes
Recall that regression coefficients estimate average causal effects aggregated over the unit-level effects in a specific group (population). For example, the coefficient of a logistic model estimates the population log odds ratio
For an illustration of this group dependence of effects, consider the hypothetical situation in Table 2. Here, we have two groups, each with 600 individuals. The upper part of the table shows the numbers of individuals with different potential outcomes
A hypothetical example of two groups, each with 600 people.
Note: The upper part of the table shows the numbers of these people with different values of the potential outcomes
Coefficients in binary (and any other) regression models thus estimate effects that are group-specific quantities. In other words, the
Models With Different Covariates
Causal effects are thus group dependent because individuals are heterogeneous in their responses to any treatment, and because groups are heterogeneous collections of individuals. Some of this heterogeneity is likely to be due to differences in other observable characteristics of the individuals that have causal effects of their own on the response. Such other characteristics, which we denote here by
What we consider here is how average causal effects of
Our overall conclusions on this question follow from the discussion in The Value of an Effect is Group dependent subsection. Models with different covariates
The literature on group comparisons is in large part fairly unclear on this topic. Much of the discussion in it raises as problems issues that only arise if we implicitly or explicitly think that estimates of effects for one group should also apply to other groups. To explain this, we may ask the following questions: “What, if anything, can estimates from a model for
It is sufficient to consider just one additional explanatory variable
We will contrast it with the following model which also includes
Some of the discussion compares logit models with linear regression models. This can be done even with linear models for the same binary
corresponding to equations (15) and (16), respectively (the fact that the standard assumptions of a linear model are not fully appropriate for a binary response can be ignored for the questions considered here). We illustrate the discussion with reference to the example in Table 2, the lower part of which further separates the numbers of people in each of the two groups by gender (
Suppose, first, that we fit the models (15) and (17) that include only
In general, the answer to this question is obviously no. This is because there is only one overall effect of
Suppose now that the interaction is, nevertheless, absent. Imagine first that this is the case for the risk difference, which thus has the same value
A similar conclusion does not hold for the log odds ratio. Suppose that the interaction is absent on this scale, so that the log odds ratio is
This result is most often introduced in the context of models with a continuous
Suppose now that we want to move in the other direction, that is, use estimates from a model that controls for
Recall first that if there is in fact no interaction for the risk difference, the linear models (17) and (18) will be estimating the same true effect
The special feature of this situation for the linear model becomes apparent when the assumption of no interaction does not in fact hold in the population. The no-interaction model (18) is then wrong, and the
where
There is no similar result for logit models, but neither do we need one. If we do want to aggregate up from quantities that are conditional on
and
Conclusions
When researchers use regression models for binary outcomes, they should first make sure to be clear about what the target quantities of their analysis are. In this article, we have argued that when this is done, they will in most cases be able to conclude that comparisons of estimates from such models between different groups or between different models pose no fundamental problems or at least not the kinds of problems that have been raised in the literature on this question.
Of the two kinds of group comparison problems that have been discussed in the literature, one is expressed in terms of a hypothetical continuous latent variable
The second form of the proposed group comparison problem does not involve latent outcomes. It arises instead when individuals are heterogeneous in their responses to the treatment of interest and differently heterogeneous in different groups. We have argued that this type of heterogeneity is not a problem but an unavoidable fact, so that the kinds of average causal effects that we can hope to estimate are inherently group dependent. Bearing this is mind, such effects can be compared between groups (populations) as long as they are correctly estimated. The researchers’ aims should then be clear about what their target populations are and to fit models that estimate effects in those populations.
While these should be reassuring conclusions, they of course do not mean that it will be easy to estimate causal effects (or population associations either), in one group or several. The real problems with doing this are the ones that were not discussed in this article and that we assumed away at the start of it, namely, ensuring that the research design, measurement, and analysis are sufficiently powerful to allow valid conclusions to be drawn from a study. The true challenges of methodology lie there.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
