Abstract
In regression analysis with a continuous and positive dependent variable, a multiplicative relationship between the unlogged dependent variable and the independent variables is often specified. It can then be estimated on its unlogged or logged form. The two procedures may yield major differences in estimates, even opposite signs. The reason is that estimation on the unlogged form yields coefficients for the relative arithmetic mean of the unlogged dependent variable, whereas estimation on the logged form gives coefficients for the relative geometric mean for the unlogged dependent variable (or for absolute differences in the arithmetic mean of the logged dependent variable). Estimated coefficients from the two forms may therefore vary widely, because of their different foci, relative arithmetic versus relative geometric means. The first goal of this article is to explain why major divergencies in coefficients can occur. Although well understood in the statistical literature, this is not widely understood in sociological research, and it is hence of significant practical interest. The second goal is to derive conditions under which divergencies will not occur, where estimation on the logged form will give unbiased estimators for relative arithmetic means. First, it derives the necessary and sufficient conditions for when estimation on the logged form will give unbiased estimators for the parameters for the relative arithmetic mean. This requires not only that there is arithmetic mean independence of the unlogged error term but that there is also geometric mean independence. Second, it shows that statistical independence of the error terms on regressors implies that there is both arithmetic and geometric mean independence for the error terms, and it is hence a sufficient condition for absence of bias. Third, it shows that although statistical independence is a sufficient condition, it is not a necessary one for lack of bias. Fourth, it demonstrates that homoskedasticity of error terms is neither a necessary nor a sufficient condition for absence of bias. Fifth, it shows that in the semi-logarithmic specification, for a logged error term with the same qualitative distributional shape at each value of independent variables (e.g., normal), arithmetic mean independence, but heteroskedasticity, estimation on the logged form will give biased estimators for the parameters for the arithmetic mean (whereas with homoskedasticity, and for this case thus statistical independence, estimators are unbiased, from the second result above).
Keywords
1. Introduction
In regression analysis with a continuous and positive dependent variable, a multiplicative model for the relationship between the dependent and independent variables is often used. This is in part done when the relationship is thought to be multiplicative. It is also in part done because the coefficients then have interpretations as relative differences, making reporting of results and comparison across studies easier.
For such a multiplicative model, one has the choice of estimating it on its unlogged versus logged form, that is, regressing the unlogged value of the dependent variable on an exponential function of the independent variables versus regressing the logged dependent variable on a linear function of the the same variables (usually referred to as the semi-logarithmic specification). Although the underlying relationship is multiplicative in both cases, and mathematically they are indistinguishable, the form in which the model is estimated can have a major impact on the estimated coefficients, such as a negative coefficient for a variable in one form but a positive coefficient in the other, leading to opposite substantive conclusions. The choice of form on which to estimate is thus an issue of significant practical interest.
The first goal of this article is to articulate why such large discrepancies may arise. The reason is that estimation on the unlogged form yields coefficients for the relative arithmetic mean of the unlogged dependent variable, whereas estimation on the logged form gives coefficients for the relative geometric mean of the same unlogged dependent variable (corresponding to absolute differences in the arithmetic mean of the logged dependent variable). 1 Estimated coefficients may therefore vary widely between the two forms because of their different foci, relative arithmetic versus relative geometric means. The phenomenon is well understood among statisticians but not widely recognized in sociology. 2
The second goal is to derive the conditions under which estimation on the logged form (for the parameters for the relative geometric mean) gives biased versus unbiased estimators of the parameters for the relative arithmetic mean. This has been addressed in two important and related literatures in economics focusing on the role of heteroskedasticity of the error term for bias (Manning and Mullahy 2001; Santos Silva and Tenreyro 2006).
This article provides a complement to these literatures by adopting a different focus. It considers the fact that entirely different aspects of the conditional distribution of the dependent variable are estimated in the two versions: the arithmetic versus the geometric mean. This focus carries over to the error terms, too. The investigation relies on a set of relationships between the two specifications and between their unlogged and logged error terms. Five results flow from these relationships: (1) I derive the necessary and sufficient conditions for when estimation on the logged dependent variable will give unbiased estimators for the parameters for the relative arithmetic mean; this requires not only arithmetic mean independence of the unlogged error term on regressors but that it also is geometric mean independent. (2) I show that statistical independence of the error terms on regressors, which implies their homoskedasticity, also implies that there is both arithmetic and geometric mean independence for the error terms, and is hence a sufficient condition for lack of bias. (3) I show that statistical independence, although sufficient, is not a necessary condition for absence of bias. (4) I show that homoskedasticity is neither a necessary nor sufficient condition for absence of bias; in fact, heteroskedasticity may sometimes be required in order to achieve unbiased estimators. (5) I show that in the semi-logarithmic specification, for a logged error term with the same qualitative distributional shape at each value of independent variables (e.g., normal), arithmetic mean independence, but heteroskedasticity, estimation on the logged form will give biased estimators for the parameters for the arithmetic mean (whereas with homoskedasticity, and for this case thus statistical independence, estimators are unbiased, from the second result above).
How often do results from analyses based on arithmetic versus geometric means diverge in major ways? This is difficult to assess, because few authors report results for both measures. The simulations and analyses of real data reported in the economics literature cited above indicate that divergences can be major. In substantive investigations, there are several comparisons between the unlogged linear and the semi-logarithmic (logged multiplicative) specification. Hannon and Knapp (2003) discussed how the semi-logarithmic specification may give rise to biases. Portes and Zhou (1996) reported positive differences between self-employed and employed persons in the unlogged linear specification but negative differences in the semi-logarithmic specification. Their article is especially important as it explicitly addresses the major differences that may arise from considering geometric rather than arithmetic means (though without using the former term). For that reason, the real-life empirical example below uses a variation of their data and specifications. 3 Goodman’s companion article to my article (this volume, pp. 165–181) gives the exact conditions for when the size ordering of means for two groups will be the opposite for arithmetic and geometric means.
Before proceeding, two comments are in order. First, the task here is not to propose that the semi-logarithmic specification be abandoned wholesale. There can be reasons to retain this specification, especially related to its ease of estimation. But its focus on the geometric mean may make it less desirable, because this is rarely a quantity natural to report. 4 As will be argued, when feasible, best practice will therefore lead one to estimate the unlogged specification.
Second, I underline that I ask a limited question: Once a multiplicative model for the relationship between dependent and independent variables has been chosen, should the model be estimated on its unlogged versus its logged form? To this question a clear answer will be given. But several related issues are not addressed. What is the role of theory in choosing functional form? What is the role of transformations of dependent variables more generally (e.g., to make the conditional distribution of the dependent variable and hence the error term nearly normal or more symmetric; see Carroll and Ruppert 1988)? Can one use measures of goodness of fit to choose between various transformations, such as whether to specify a multiplicative model versus some other model? Each of these questions is important. But for the present purposes they need not be answered and have yet to be settled by statisticians or social scientists, with considerable disagreement for each question.
The remainder of this article is organized as follows. Section 2 outlines the models discussed (Section 2.1), properties of the dependent variable (Section 2.2), properties of the error terms (Section 2.3), and relationships between error terms (Section 2.4). Section 3 gives empirical examples of the models, with a focus on the dependent variable. Section 4 focuses and elaborates on the properties of the error terms and specifies the conditions under which the estimators from the semi-logarithmic specification are unbiased for the coefficients of the relative arithmetic mean. Sections 5 to 8 provide discussions of the role of statistical independence of the error terms from the regressors and of heteroskedasticity versus homoskedasticity of the error terms. Section 9 concludes the article.
Although the key point about the difference in results from estimating on the unlogged versus the logged dependent variable may be obvious to the statistically sophisticated researcher, and would hence warrant a compact presentation, many researchers find it hard to grasp that large differences in estimates may arise. I have hence opted for approaching this conundrum at length and with repetition from multiple angles in order hopefully to make it clear why large differences may occur.
2. Specifications: Equations, Error Terms, and Their Relationships
2.1. The Equations
In addition to the unlogged and logged specifications referred to above, I specify an exponential model with an additive error term, and for completeness also the linear model for the unlogged dependent variable, but the latter will receive limited attention. The exponential-additive model has received attention in the methodological literature and has some advantages (e.g., see Blackburn 2007:83; Manning and Mullahy 2001:467; Santos Silva and Tenreyro 2006:644).
Let
Before elaborating on the specifications, it is useful to collect them in one place:
The four specifications are referred to as the exponential-multiplicative, the semi-logarithmic, the exponential-additive, and the linear. 5
With a focus on equations (2.1) to (2.3), the models differ in several ways, even though equations (2.1) and (2.2) are mathematically indistinguishable without further information. First, they differ in which aspect of
Expressions for the Four Specifications: (2.1) Exponential-multiplicative, (2.2) Semi-logarithmic, (2.3) Exponential-additive, and (2.4) Linear
Note: Panel A: specifications for the dependent variable: (a) unlogged values and (b) logged values. Panel B: associated expressions for the dependent variable: (a) arithmetic mean for unlogged values, (b) arithmetic mean for logged values, (c) geometric mean for
This is not a relevant computation to make, because
This is not a relevant computation to make, because
The geometric mean of
The error term
The error term
In the two exponential specifications, estimation is done on the unlogged dependent variable, generally using maximum or quasi-maximum or pseudo-maximum likelihood methods or nonlinear least squares (see Wooldridge 2010:740–42). In the semi-logarithmic specification, estimation is done on the logged value, and in the linear on the unlogged value, in both cases using linear least squares.
2.2. Elaboration 1: Focus on Dependent Variable
Each of the four specifications are elaborated in several subequations. For notation, “E” and “G” denote the operators for the arithmetic and geometric means respectively. For the exponential-multiplicative model the key quantities are
where I have yet to see equation (2.5c) specified. From the viewpoint of the exponential-multiplicative specification, with its focus on the relative arithmetic mean, it specifies the implied expression for the geometric mean. It will be used in Section 4.1, focusing on conditions for absence of bias when estimating on the logged dependent variable. The exponential-multiplicative specification in equation (2.5a) is a model for the arithmetic mean of the unlogged dependent variable.
For the semi-logarithmic specification the key quantities are
where equation (2.6a) is rarely specified. From the viewpoint of the semi-logarithmic specification, with its focus on the geometric mean of the unlogged dependent variable, it specifies the implied expression for the arithmetic mean. It will be used in Section 4.2, focusing on conditions for absence of bias when estimating on the logged dependent variable. The semi-logarithmic specification in equation (2.6c) is a model for the geometric mean of the unlogged dependent variable (corresponding to the arithmetic mean of the logged dependent variable, equation 2.6b). In Appendix A, I derive the geometric mean for an absolutely continuous variable.
For the exponential-additive and linear specifications, only the arithmetic mean of the unlogged dependent variable is of interest:
The results for the exponential-additive and linear specifications in equations (2.7a) and (2.8a) are straightforward. It is the comparison of the exponential-multiplicative and the semi-logarithmic specifications that demands attention.
As stated in Section 2.1, the exponential-multiplicative specification in equation (2.1) and the semi-logarithmic in equation (2.2) are mathematically indistinguishable in both their unlogged and logged forms. However, after arithmetic and geometric expectations have been taken in equations (2.5) and (2.6), the expressions are different, as they rely on different assumptions on the error terms, to be elaborated in Section 2.3. In the exponential-multiplicative model, the focus is on the arithmetic mean of the unlogged dependent variable, as given in equation (2.5a), where the logged arithmetic mean is a linear function of the independent variables [
Although not simple to keep track of, to ease comparison across specifications, the elaborations above are collected in Panel B of Table 1 in lines (a) to (c), along with a few related quantities in lines (d) to (f).
In Table 1, the far-left column gives the quantities on the left sides of the equations: in Panel A for the dependent variable itself, in Panel B for functions of the dependent variable, and in Panel C for functions of the error terms. The four columns to the right of the equality sign report the right sides of the equations, one column for each of the four specifications. Consider Panel B, line (a)—for the expectation of the unlogged dependent variable—the expressions in columns 1 to 4 give the right sides of equations (2.5a) to (2.8a) for, respectively, the exponential-multiplicative, semi-logarithmic (after exponentiation), exponential-additive, and linear specifications. When I refer to equation (2.5a) it means the equation in line (a) in column 1, Panel B of Table 1, and so on for the other lines and columns. The quantities on the left side of lines (d) to (e) and of line (f) are explained respectively in the table and in Section 3 below.
2.3. Elaboration 2: Focus on Error Terms
It is instructive to elaborate on the error terms, especially in the exponential-multiplicative and semi-logarithmic specifications, and not only in terms of standard assumptions with respect to arithmetic mean independence of independent variables, but also whether there is geometric mean independence, the condition that turns out to be crucial for whether coefficients from the semi-logarithmic specification will yield unbiased estimators for the coefficients for the arithmetic mean.
I consider the case in which the variance of
(See also Table 1, Panel C.) For
Geometric mean independence for
In the semi-logarithmic specification we have
(See also Table 1, Panel C.) For
Arithmetic mean independence for the unlogged error term
This equality does not follow from equation (2.11b) or (2.11c) above.
For the exponential-additive and linear models, only the arithmetic mean of the unlogged error term is of interest:
To briefly summarize, with respect to arithmetic mean independence of the error terms on
The different assumptions on the error terms have implications for the results on estimation. Regressing
Against this background, a central question is hence asked (see Section 4): Which assumptions are needed on the error terms for the coefficients for the relative geometric mean (
2.4. Relationships between Error Terms
The error terms for the geometric and arithmetic means are related in important ways, relationships that to my knowledge have not previously been explored but will be needed in an investigation of the role of homoskedasticity versus heteroskedasticity of error terms in Section 7. These relationships hold even though the two specifications focus on entirely different aspects of the dependent variable and are hence different models, and they hold for quite general specifications for
The error term
From equation (2.15) above we can show that the assumptions on
Here, equations (2.16a) to (2.16c) confirm equations (2.11a) to (2.11c). It is noteworthy that the conditional arithmetic mean of
From the relationships above, we have two additional and also novel results on the relationships between the conditional variances [V(.)] of the two error terms:
The two error terms
In Appendix B, I give the parallel expressions for the error term
In summary, the following holds for the error terms in the exponential-multiplicative and semi-logarithmic specifications:
The arithmetic mean of
the conditional variance of
the conditional variance of
The results will be used to derive the conditions for absence of bias for the parameters of the relative arithmetic mean when estimating on the logged dependent variable (Section 4) and for the role of homoskedasticity and heteroskedasticity for bias (Sections 6 and 7).
3. Focus on Dependent Variable: Arithmetic Versus Geometric Means
3.1. Technicalities
Having accounted for the four specifications, I proceed to illustrate empirically the large discrepancies in estimates that can arise from estimating on the unlogged versus logged form of the dependent variable—the two different foci, arithmetic versus geometric means—or why equations (2.1a) and (2.3a) will agree (given the same estimator), whereas equation (2.2b) can give different results, even different signs for coefficients.
It is most instructive to focus on the simplest of all cases, with only one binary independent variable. This exploits the central features of the situation maximally and allows simple expressions for the estimators. The difference between the specifications then reduces to which aspect of the conditional distribution they summarize, the relative arithmetic mean in equations (2.1a) and (2.3a) versus the relative geometric mean in equation (2.2b) (and the absolute arithmetic mean in equation 2.4a). We thus obtain a ceteris paribus conclusion with respect to which aspect of the distribution gets estimated in the four specifications, without having other features of the specifications (such as linearity versus nonlinearity) weigh in. The single independent variable is self-employment status
For the empirical example, I report two sets of results from the two exponential specifications (2.1a) and (2.3a). The first set embeds the models within the framework of the generalized linear model (GLM) and uses a gamma and Poisson distribution for the dependent variable (see McCullagh and Nelder 1989). 10 These are the models used in earlier investigations (see Blackburn 2007; Manning and Mullahy 2001; Santos Silva and Tenreyro 2006), and are hence used here in part for consistency with these studies and in part because they are attractive for researchers because of software availability. The second set of results uses nonlinear least squares for the estimation. This is done in part because nonlinear least squares has played a role in the same prior investigations and in part for a ceteris paribus reason. In the estimation within the GLM framework, there is in the multivariate case (also considered) correction for the heteroskedasticity between the dependent and the independent variables. I want to illustrate the distinction between arithmetic and geometric means as clearly as possible. Making heteroskedasticity corrections to the estimated arithmetic means brings in additional considerations and could jeopardize a ceteris paribus conclusion. The nonlinear least squares estimation removes this concern.
The four specifications are
To see the expressions for the estimators, let
Here
That the geometric mean generally differs from the arithmetic mean is obvious: it obtains as the
The arithmetic and geometric means coincide only when the variance equals zero; all observations have the same value. Otherwise, the arithmetic mean is larger than the geometric. The geometric mean may decline as the variance increases, but not necessarily. The relationship between the variance and size of the geometric mean is complex, except in special cases. If
For the estimators from the four specifications, the focus will be on three quantities: (1) the constant term, (2) the coefficient for self-employment, and (3) the implied estimate of the relative difference between self-employed and employed.
In the exponential-multiplicative and exponential-additive specifications, the quasi-maximum likelihood and nonlinear least-squares estimators give
In a model with only a constant term and a binary independent variable, the relevant estimators yield identical estimates.
In the semi-logarithmic specification, the least-squares estimators are
Finally, in the linear specification, the least-squares estimators are 11
Two remarks are in order. First, in the exponential-multiplicative and the exponential-additive specifications, the coefficient for self-employment gives the impact on the relative difference in arithmetic means on the dependent variable in the two groups, estimated as the difference in the logarithms of the arithmetic means in the two groups. They describe the data correctly in terms of arithmetic means. When the coefficient is small—say, less than .10 in absolute value—this is close to the relative difference in arithmetic means between the two groups. In the linear specification (in equation 3.4), the coefficients give the absolute difference in arithmetic means between the two groups.
Second, in the semi-logarithmic specification, the coefficient for self-employment is the difference between self-employed and employees in the average of the logarithms of the dependent variable, or the difference in the logarithms of the geometric means. As above, when the coefficient is small, this is close to the relative difference in geometric means between the two groups. It is also often close to the relative difference in arithmetic means, but it is not equal to it and can be quite different. This can be seen from equation (3.7c), in which the relative difference is computed for the geometric means. Researchers, when using the semi-logarithmic specification, sometimes report the estimated mean value of the dependent variable in its unlogged form as
3.2. Example: Self-employment and Earnings
To illustrate the issues with real-life data, I selected the sample of native-born black men residing in California in 1990 from the 5 percent Public Use Micro Samples of the 1990 U.S. census, using the 1990 census because these were the data used in the study by Portes and Zhou (1996) cited earlier. As in their study of the impact of self-employment on earnings among foreign-born men in the United States, I restricted the sample to those 25 to 64 years old, who worked at least 160 hours and earned at least $1.00 in 1989, where hours worked is imputed as weeks worked times usual hours worked per week. Using a different set of men allows one to show that the sign reversals they report occur in other groups as well.
Annual earnings is the dependent variable. Results are similar using imputed hourly wages as the dependent variable. The independent variables are the same as in Portes and Zhou (1996), but with immigration year, English proficiency, and region excluded. The included variables are listed in Table 2, note e.
Estimates of Coefficient for Self-employment Status on Annual Earnings from Five Specifications of the Earnings Equation (Estimated Standard Errors in Parentheses)
Source: The data are taken from the 5 percent Public Use Micro Sample of the 1990 U.S. census. The sample consists of native-born black men residing in California, aged 25 to 64 years, who worked at least 160 hours and earned at least $1.00 in 1989. Imputed hours worked per year obtains as weeks worked times usual hours worked per week. The number of employees is 12,119, and the number of self-employed persons is 724, yielding a total sample of 12,843.
Note: The results are from the two exponential specifications (from GLM estimation with gamma and with Poisson distribution for dependent variable), the semi-logarithmic, the two exponential specifications (from NLLS estimation), and the linear model. For discussion of issues and results, see Sections 3.1 and 3.2, respectively.
In Panel A,
In Panels B and C, in columns 1 and 3, the headings refer to the coefficients (
In Panel B, the estimates come from regressions including a constant term and a dummy variable for self-employment status. SE(.) stands for the standard error of the coefficient.
The estimated relative earnings in Panel B obtain as follows: in the two sets of GLM estimates for the two exponential specifications, semi-logarithmic, and the NLLS estimates for the two exponential specifications from equations (3.6c) and (3.7c) as
In Panel C, the regressions add the following variables to those in Panel B: marital status (one dummy variable), work experience (years of age minus years of schooling plus 6), education (three dummy variables), and whether one lives with own children (one dummy variable). See Portes and Zhou (1996:229, Appendix) for further explication of these variables. SE(.) stands for the standard error of the coefficient.
The estimated relative earnings in Panel C obtain by same formulas as in Panel B (see note d above), with the exception of the linear specification, which obtains from equation (2.8f) (in Table 1) as
Not significantly different from zero at the 5 percent or 10 percent level.
Panel A of Table 2 shows that the arithmetic mean for annual earnings is higher for the self-employed than for employees, by 27.4 percent (from column 10), as also shown from the logarithms of the means (from column 1,
Panel B of Table 2 gives the estimates of equations (3.1) to (3.4). For the two exponential specifications in equations (3.1) and (3.3), columns 1 and 3 give quasi-likelihood estimates for the GLM framework with a gamma and Poisson distribution for the dependent variable, and column 8 nonlinear least squares estimates, the latter being the most appropriate for the exponential-additive specification in equation (3.3) (as in Manning and Mullahy 2001:468–69, Table 1) . For the semi-logarithmic in equation (3.2) and the linear in equation (3.4), columns 5 and 10 give linear least squares estimates. These bivariate regressions include only a constant term and a dummy variable for self-employment status. For the two exponential specifications, we get identical coefficients across the two GLM and nonlinear least squares procedures, and these reproduce the underlying differences in arithmetic means perfectly, as equation (3.6c) shows they will (as is also the case in the linear specification). They imply the same relative difference in annual earnings of 27.4 percent. The semi-logarithmic specification reproduces perfectly the underlying differences in the logarithms of the geometric means, with coefficient of –.043, as equation (3.7c) shows it will. Under the common interpretation, this would be reported as implying that self-employed persons earn 4.2 percent less than employees, which is correct for geometric but not for arithmetic means, for which the opposite is the case, and annual earnings are 27.4 percent higher.
Panel C reports estimates from the multivariate regression models, using the same estimation techniques as in Panel B. We see the same result: the change in sign for the coefficient for the self-employed between on the one hand the two exponential specifications (as well as the linear) and on the other hand the semi-logarithmic specification. According to the two exponential specifications, self-employed persons earn more than employees (see below). According to the semi-logarithmic specification, self-employed persons on average earn about 16.3 percent less than employees, adjusting for the other variables. This is correct for geometric but not for arithmetic means. According to the linear specification, self-employed persons on average earn $7,735 more than employees.
The GLM specification with a Poisson distribution and the nonlinear least squares estimation yield relative differences of 10.6 percent and 15.6 percent that agree roughly with the relative difference of 12.0 percent in the linear specification, when for the latter it is evaluated at the sample mean of the dependent variable for employees. The GLM specification with a gamma distribution gives the smallest coefficient for self-employment status. The reason for differences is simple. There is stronger correction for heteroskedasticity in the dependent variable in the gamma than in the Poisson distributed model, and no such correction in the nonlinear least squares estimates (e.g., Santos Silva and Tenreyro 2006:645–46). 13
From the examples above, where we see a sign change for the group with the larger variance, one might be tempted to conclude that this is a general result. It is not. There are simply put no relationships between arithmetic and geometric means that follow from variances, except in special cases. The means and the variances for the unlogged observations may for example be equal for two groups, but there may be a sign change when focusing on the logged observations. In general, equal arithmetic means but different variances do not imply that the group with the higher variance has the lower geometric mean. 14
4. Focus on The Error Terms: Condition for Absence of Bias
A central question is, When does estimation on the semi-logarithmic specification result in unbiased estimators for the coefficients for the relative arithmetic mean (in the exponential-multiplicative or exponential-additive specification)? To answer this question I shift focus from properties of the dependent variable to properties of the error terms. This topic has generated significant literatures and is worthy of a more detailed investigation.
What follows is technically straightforward but at times cumbersome. The key insight and technical formulation is simple, though novel, given already in equations (2.5c) and (2.6a). From those formulations, two results follow. First, in Section 4.1, focusing on the error term in the exponential-multiplicative specification, I derive the necessary and sufficient condition for when estimation on the semi-logarithmic specification yields unbiased estimators of the coefficients for the relative arithmetic mean—that is, when do coefficients for relative geometric and arithmetic means coincide. Second, in Section 4.2, I do the corresponding derivations for the error term in the semi-logarithmic specification. For both specifications, the condition is that there is simultaneously arithmetic and geometric mean independence of the unlogged error terms. This result has to the best of my knowledge not been articulated as simply before, though it has clearly been implicit in earlier investigations (cited in Section 7 below). Recall that the unlogged error term in the exponential-multiplicative specification is arithmetic mean independent (see equation 2.9a), but not necessarily geometric mean independent (see equation 2.9c), while the unlogged error term in the semi-logarithmic specification is not necessarily arithmetic mean independent (see equation 2.11a), but it is geometric mean independent (see equation 2.11c). As in the results for the relationships between the error terms in the exponential-multiplicative and semi-logarithmic specifications, the results below hold for quite general specifications of
4.1. Error Term in the Exponential-multiplicative Specification
In the exponential-multiplicative and the semi-logarithmic specifications, we estimate respectively the parameters for the relative arithmetic (
Angle 1: Unlogged Error Term. The key tool is the expression for the geometric mean in the exponential-multiplicative specification:
where the formulation in equation (4.1a) was already stated in equation (2.5c). It is entirely straightforward, almost trivial, though unusual, and it appears not to have been used in past investigations. It takes the exponential-multiplicative specification and its focus on the arithmetic mean as the point of departure. It then asks, given this formulation, instead of focusing on the arithmetic mean as in equation (2.5a), What would the expression be for the implied geometric mean? It shows that the conditional geometric mean for the unlogged dependent variable in the exponential-multiplicative specification equals its conditional arithmetic mean
The left side of equation (4.1a), or of equation (4.1b), equals the first term on the right side (the conditional arithmetic mean) only when the geometric mean of the error term (
The second and more useful question asked above is, When will a coefficient for a measured variable from the semi-logarithmic specification (for the relative geometric mean) yield an unbiased estimator of the corresponding coefficient in the exponential-multiplicative model (for the relative arithmetic mean). That is, when does a coefficient in the two specifications represent the same quantity, or when do relative geometric and arithmetic means coincide? The expression we then need to investigate is
where
equation (4.2) gives a first angle to the question of bias, involving an investigation of the unlogged error term in the exponential-multiplicative specification: the relative difference on the dependent variable between the two geometric means equals the relative difference between the two arithmetic means only when the geometric means of the two error terms are equal—that is, when there is geometric mean independence in the error terms. The latter in turn corresponds to arithmetic mean independence of the logged error terms (from equation 2.10b).
Using the parametrizations in equation (4.1b), and where
Again, we see that
Angle 2: Logged Error Term. A second angle on the question of bias obtains from equations (2.5b) and (2.6b):
The linear least squares estimator corresponding to equation (4.4a) (or equation 4.4b) will always yield unbiased estimators for
Further insight can be gained by subtracting equation (4.4b) from equation (4.4a), yielding
We see that the arithmetic mean of
4.2. Error Term in the Semi-logarithmic Specification
For the exponential-multiplicative specification, with its focus on the arithmetic mean, I asked what is the implied expression for the geometric mean, reported in equation (4.1a). Mimicking the procedure in equations (4.1) to (4.3), I now ask the parallel question for the semi-logarithmic specification, with its focus on the geometric mean: What is the implied expression for the arithmetic mean? The first expression is
where equation (4.6a) comes from equation (2.6a) and where in equation (4.6b) I have inserted the relevant parametrizations of the arithmetic and geometric means, in same manner as in equation (4.1b).
From equation (2.16a), recall that
The first question is, When is the geometric mean [
The second question is, when do the two sets of coefficients for measured variables coincide? We need to assess (as done in equation 4.2) the ratio of the expression in equation (4.6a) evaluated at
where equation (4.8a) is the parallel expression to equation (4.2) and where equation (4.8b) follows from equation (4.7). The two sets of coefficients for measured variables coincide only when
4.3. Summary
Although estimation on the logged dependent variable gives unbiased estimators for the coefficients for the relative geometric mean (
One can test for geometric mean independence. Estimate first
5. Statistical Independence: Sufficient Condition for Absence of Bias
A well-understood case in which estimation of the semi-logarithmic specification will yield unbiased estimators of the coefficients for the relative arithmetic mean arises when the conditional distribution of the error term
With statistical independence, the probability density functions satisfy
where the second equality follows from the first. This distribution can follow any form and can be highly skewed and highly asymmetrical.
That the error term is statistically independent of the regressors is a strong condition. From it flows immediately five other results, the first three of which are
where equations (5.3) and (5.4) are well known and easy to show. The result in equation (5.5) follows from equations (A.4a) and (A.4f) in Appendix A applied to
The first and third results establish arithmetic and geometric mean independence for the unlogged error term, whereas the second result establishes arithmetic mean independence for the logged error term. With statistical independence, there is both arithmetic and geometric mean independence in the unlogged error terms, and the coefficients from the semi-logarithmic specification yield unbiased estimators for the coefficients for the relative arithmetic mean.
But the condition of statistical independence implies two additional results not needed in order to avoid bias (to be shown in Section 7):
namely, that there is homoskedasticity in both the unlogged and logged error terms.
One case of statistical independence has been given considerable attention:
It is the conditional distribution, not the marginal, that needs to be normal. This of course is a special case, unlikely to arise often in empirical research.
The analysis above shows that statistical independence of the error terms from the independent variables is a sufficient condition for absence of bias.
6. Statistical Independence: Necessary for Absence of Bias?
Statistical independence for the error terms on regressors is a sufficient condition for absence of bias when estimating on the logged dependent variable. One may reasonably ask whether it is also a necessary condition, that is, whether simultaneous arithmetic and geometric mean independence also requires statistical independence. I show below through an analytical example that it is not a necessary condition, using proof by counterexample, with details in Appendix C. The result could also have been easily obtained by numerical example, but the analytical example also gives tools for investigating the role of heteroskedasticity (in Section 7).
Consider two groups where the unlogged error terms in the exponential-multiplicative specification follow respectively a lognormal (group 0) and log-uniform (group 1) distribution (see Appendix C.1). There is hence a lack of statistical independence for the error terms on regressors, probably the most common situation. Both are symmetric two-parameter distributions in the logged error terms; the unlogged error terms cannot be symmetrically distributed, because they are bounded from below by 0. The arithmetic means of the unlogged error terms are equal (=1.0), and so are the geometric means (<1.0) and hence also the arithmetic means of the logged error terms (<0) (see Appendix C.2). Even with lack of statistical independence, there can still be simultaneous arithmetic and geometric mean independence for the unlogged error terms (see Appendix C.3). Estimation on the semi-logarithmic specification then yields estimators that are unbiased for the parameters of the relative arithmetic mean.
In the example considered, for a given lognormal distribution for the unlogged error term in group 0, there is one set of parameters for the log-uniform distribution in group 1 so that there is geometric mean independence in the error terms. The likelihood of two distributions matching up in this way is so close to zero that practically no number of decimals can do it justice. Estimates from the semi-logarithmic specification are by implication almost surely biased for the parameters of the exponential-multiplicative specification and can only be unbiased, as is shown in Section 7, if the error terms exhibit a specific heteroskedasticity. The size of this bias can of course vary widely, from small to large. (For the case of large bias, see the example in Section 3.2).
As the example shows, the fact that the error terms depend statistically on the independent variable does not result in bias in the coefficients. Statistical independence of error terms on independent variables is thus a sufficient but not a necessary condition for lack of bias of the coefficients in the semi-logarithmic specification for the coefficients for the relative arithmetic mean.
7. Homoskedasticity: Necessary or Sufficient for Absence of Bias?
Section 6 established that statistical independence (with its implied homoskedasticity) for the error terms on the regressors is not a necessary condition for absence of bias for the coefficients for the relative arithmetic mean when estimating on the logged version of the dependent variable. The next question is whether homoskedasticity is a necessary or sufficient condition for absence of bias.
Considerable attention has been given to the role of homoskedasticity versus heteroskedasticity of the error terms for bias, with two important recent contributions. One approach focuses on heteroskedasticity in the unlogged multiplicative error term in the exponential-multiplicative specification. Santos Silva and Tenreyro (2006) wrote, “The basic problem is that log-linearization (or, indeed any nonlinear transformation) of the empirical model in the presence of heteroskedasticity leads to inconsistent estimates” (p. 653). Another approach focuses on heteroskedasticity in the logged error term of the semi-logarithmic specification. Manning and Mullahy (2001) wrote, “Unfortunately, when log-scale error term
7.1. Analytical Example
I use the analytical example from Section 6 to prove the points around heteroskedasticity and homoskedasticity by means of counterexamples. The example focused on the error terms for two groups (with details in Appendix C). In constructing the error terms, I started with the unlogged error term
Computationally, I take the variance of the error term in group 0 in the exponential-multiplicative specification as the basis, and then derive what the variance in group 1 must be for estimation on the logged dependent variable to yield unbiased estimators for the coefficients for the relative arithmetic mean (see Appendix C.3). Figure 1 gives for the exponential-multiplicative specification the resulting plots with the variance of the logged error term in group 0 on the horizontal axis (in the for the logged metric considerable range from 0 to 1), and the unlogged variances in groups 0 and 1 as well as the logged variance in group 1 on the vertical axis. The plots of the variances of the corresponding unlogged error terms in the semi-logarithmic specification are shifted upward relative to the exponential-multiplicative variances with a factor equal to the inverse of the geometric mean

Variances of
Consider first the four cases of heteroskedastic error terms (logged and unlogged) in the two specifications. As Figure 1 shows, both the logged and unlogged error terms need to be heteroskedastic in order for the estimators to be unbiased. This case was considered already in Section 6.
Homoskedasticity is therefore not a necessary condition for estimation on the logged dependent variable to give unbiased estimators for the coefficients for the relative arithmetic means. In each of the four cases, the error term is heteroskedastic, but estimators are unbiased.
Consider next the four cases of homoskedastic error terms (logged and unlogged) in both specifications. In each of the four cases, if there is homoskedasticity, the condition of geometric mean independence in the unlogged error term in the exponential-multiplicative specification is violated, because when the condition holds heteroskedasticity follows (see again Figure 1). Hence, under homoskedasticity, estimation on the logged dependent variable gives biased estimators for the coefficients for the relative arithmetic means. There exists no set of homoskedastic error terms (unlogged or logged) that will yield unbiased estimators (see Appendix C.5), except for the trivial case of no residual variance.
Homoskedasticity is therefore not a sufficient condition for estimation on the logged dependent variable to give unbiased estimators for the coefficients for the relative arithmetic means. In each of the four cases, if the error term is homoskedastic, then the estimators are biased.
In summary, homoskedasticity for error terms is neither a necessary nor a sufficient condition for estimation on the logged dependent variable to give unbiased estimators for the coefficients for the relative arithmetic means. We may get unbiased estimators in the presence of heteroskedasticity, which may sometimes be required for estimators to be unbiased, whereas homoskedasticity will result in bias.
8. Same Shape for Error Distribution but Heteroskedasticity
With statistical independence of error terms on regressors, estimation on the logged form yields unbiased estimators for the parameters of the relative arithmetic mean. Statistical independence also implies homoskedasticity (see equations 5.6 and 5.7). This explains in part why there has been a major focus on the role of heteroskedasticity versus homoskedasticity of the error term for the question of bias.
Consider now the semi-logarithmic specification, and the case where for the error term ln µSi, (1) the same qualitative shape of the probability density function (e.g., normal) applies at each value of xi, (2) there is arithmetic mean independence (=0) and hence geometric mean independence for the unlogged error term µSi (=1), but (3) there is heteroskedasticity, so there is not statistical independence. This is the case extensively investigated in Manning and Mullahy (2001) for ln µSi and in Santo Silva and Tenreyro (2006) for µMi.
The question now arises: in the semi-logarithmic specification with error term ln µSi, in the heteroskedastic case outlined in points (1) to (3) above, is it possible for the unlogged error term µSi, not only the logged error term ln µSi, also to be arithmetic mean independent, the condition required for estimation on the logged form to give unbiased estimators of the parameters for the arithmetic mean? For the normal and uniform distributions for ln µSi, the answer is no. But is it in general no?
It is instructive to pursue this question further, as it has received much attention in studies of the role of heteroskedasticity for bias (see note 17). Let there be two groups captured by the dummy variable Dij, where j indexes the group and i the observation within the group. The argument below goes through without any change also for a specification including other independent variables.
In group D1i = 1, the error term ln µS,1i has (1) the same overall distributional shape as ln µS,0i, (2) the same arithmetic mean as ln µS,0i (=0), but (3) a different variance. The error term in group 1 then has this structure:
where the constant c is larger than 0. The variance is the same as in group 0 if c = 1, smaller if c < 1, and larger if c > 1. The distribution of ln µS,1i around its mean 0 is just shifted multiplicatively with the constant c relative to the distribution for ln µS,0i, with conditional variances of
While the logged error terms have same arithmetic mean (=0), and hence the unlogged error terms same geometric mean (=1), the unlogged error terms have different arithmetic means. This can be seen from taking the expectation of equation (8.1b):
Estimation on the logged form yields unbiased estimators for the parameters of the geometric mean, both under homoskedasticity and heteroskedasticity. If c = 1, the case of homoskedasticity, it would also give unbiased estimators for the arithmetic mean because there then is arithmetic mean independence also for the unlogged error term (see Section 4.2). But under heteroskedasticity this will not be the case.
Why is this? For the estimators to be unbiased also for the arithmetic mean we need the unlogged error term µS, ji to be arithmetic mean independent, which implies geometric mean independence for the corresponding unlogged error term µM, ji in the multiplicative-additive specification. From equation (8.2), we see that there is not arithmetic mean independence for µS, ji (except for special cases in which the means of µS, ji and [µ S,ji ] c are the same).
A special case was considered in Section 5 with a lognormal distribution for the error term in the exponential-multiplicative specification and hence a normal distribution for the logged error term, a case of statistical independence. Retaining lognormality, but relaxing statistical independence, and hence allowing for heteroskedasticity, will lead to bias when estimating on the logged form, sometimes even switching of signs. Suppose the arithmetic means are the same for two groups, with
In conclusion, in the semi-logarithmic specification, for a logged error term with (1) same qualitative distributional shape at each value of independent variables, (2) arithmetic mean independence, but (3) heteroskedasticity, estimation on the logged form will give biased estimators for the parameters for the arithmetic mean. This analytical investigation confirms the results in the simulations reported for the semi-logarithmic specifications (see note 17), but now for quite general distributions.
9. Conclusion and Discussion
In regression analysis with a continuous and positive dependent variable researchers often specify a multiplicative relationship between the dependent and independent variables. There is then a choice between estimating the relationship on the unlogged versus the logged form of the equation. The two estimation procedures may yield opposite signs for key parameters or other major discrepancies in results, such as a large coefficient in one specification and a small or zero coefficient in the other.
The reason for a divergence is that estimation on the unlogged form yields coefficients with impacts on the relative arithmetic mean of the dependent variable, whereas estimation on the logged form gives coefficients with impacts on the relative geometric mean of the same unlogged dependent variable (corresponding to the impact on the absolute arithmetic mean of the logged dependent variable). Coefficients may thus be very different and even change signs between the two specifications, due to their different foci, relative arithmetic versus geometric means. When authors interpret coefficients in the semi-logarithmic specification as giving relative differences in arithmetic means, then that is a misinterpretation. They give the relative differences in geometric means. The divergence between the two types of means can be big. Though the semi-logarithmic specification in many cases will give parameter estimates that are close to the relative differences in arithmetic means, it is difficult to know when this is the case or not. Some caution is thus required. Neither the phenomenon nor its reason is well understood in sociological research.
In an empirical example I showed that estimation on the unlogged versus logged dependent variable gave estimates that were very different, to the extent of opposite signs for the key coefficient. The problem is thus of considerable practical interest.
In an analytical investigation, I focused on the fact that entirely different aspects of the conditional distribution of the dependent variable are estimated in the two versions: the relative arithmetic versus the relative geometric mean. The investigation relied on a set of relationships between the exponential-multiplicative and semi-logarithmic specifications and between their unlogged and logged error terms. From these tools I obtained five results. First, I derived the necessary and sufficient condition for when estimation on the logged dependent variable will give unbiased estimators for the parameters for the relative arithmetic mean. This requires that there is not only arithmetic mean independence of the unlogged error terms but also geometric mean independence. Second, it was shown that statistical independence of error terms of the regressors implies both arithmetic and geometric mean independence, and it is hence a sufficient condition for the coefficients from the semi-logarithmic specification to be unbiased for the coefficients for the arithmetic mean. Statistical independence additionally implies homoskedasticity of the unlogged and logged error terms in both the exponential-multiplicative and semi-logarithmic specifications. Third, I showed that statistical independence of the error terms, while a sufficient condition for absence of bias, is not a necessary condition. Fourth, I showed that homoskedasticity is neither a necessary nor a sufficient condition for absence of bias. Presence of heteroskedasticity may sometimes be required in order to achieve unbiased estimators. Fifth, I showed that in the semi-logarithmic specification, for a logged error term with same qualitative distributional shape at each value of independent variables (e.g., normal, uniform), arithmetic mean independence, but heteroskedasticity, estimation on the logged form will give biased estimators for the parameters for the arithmetic mean (whereas with homoskedasticity, and for this case thus statistical independence, estimators are unbiased, from the second result above). The first, second, and fifth results were obtained by general derivations, the third and fourth by counterexamples. While I have used the term exponential-multiplicative specification, the results derived are general and do not require the model to have a structure that is exponential in the independent variables.
In terms of best practice, the coefficients for the conditional geometric mean of the dependent variable are rarely of substantive interest, whereas those for the conditional arithmetic mean are. Since results from the former can be seriously biased for the latter, best practice would lead one to estimate the exponential specifications. From some of the simulations cited in Section 7, when using the framework of generalized linear models, an exponential specification with Poisson distribution for the dependent variable seems to be recommendable. Santos Silva and Tenreyro (2006:648–49) write: "its performance is reasonably good in all cases." But best practice is not always feasible practice. For example, when fixed effects are to be included in data with a panel or group structure, the semi-logarithmic specification may for computational reasons be the best choice. But even then, in the exponential specifications consistent estimation with fixed effects is feasible for some specifications, such as when the dependent variable follows a Poisson distribution (Wooldridge 1999:81–82; Allison and Waterman 2007).
The results in this paper are thus a call for awareness that the semi-logarithmic specification sometimes can give results that are substantively difficult to interpret or even meaningless, in so far as the geometric mean is not a quantity that is natural to report in research, no more than we report average costs of going to college in terms of geometric means. As a matter of communication economies vis-a-vis readers, the semi-logarithmic specification will often still be desirable to retain, since it is widely established across all the social sciences and many other fields (e.g., Strimbu [2012]). But it then behooves researchers to check whether the results are appreciably biased against the coefficients for the arithmetic mean, those of greater substantive interest, and even to test whether estimates from the semi-logarithmic and exponential (multiplicative or additive) specifications are statistically different in a significant way from each other.
The phenomenon is general, arising in all research with a positive and continuous dependent variable where a multiplicative relationship seems substantively natural. It shows that one then should be more careful about using logarithmic transformations of a dependent variable, and instead estimate the relationship in its unlogged form. The problem is moreover generic to tranformations of dependent variables more broadly.
Footnotes
Appendix A
Appendix B
Appendix C
Acknowledgements
I thank Robert Anderson, Erling Barth, Kjell Doksum, Arthur Goldberger, Leo Goodman, David Grusky, Hans-Tore Hansen, Robert Hauser, Peter Hedström, Michael Hout, Willie Jasso, David Levine, Scott Long, Charles Manski, Arne Mastekaasa, Charles McCullogh, Eva Meyersson Milgrom, Steve L. Morgan, Roy A. Nielsen, Andrew Noymer, Andrew Penner, Ole-Jørgen Skog, Anders Skrondal, Michael Sobel, Ross Stolzenberg, Anders Rygh Swensen, Christer Thrane, and the anonymous reviewers for discussion and extensive comments. I thank Min Zhou for help in defining the sample used in the article; Michelle Arthur, Andrew Penner, and Harold Jose Toro Tulla for assistance in connections with the computations in Table 2; and Erik Petersen for assistance in the computations in Appendix C (using the program Mathematica) and for producing the graphs in Figure 1 (using the program MATLAB). In particular I thank J. M. C. Santos Silva for extensive written comments and for correcting errors, Silvana Tenreyro for weighing in on these comments, and John Mullahy for additional written comments and suggestions. And I thank Michael Sobel for commenting on the article repeatedly over many years.
Notes
Author Biography
