Abstract
Are we cautious enough when using linear models? After the 1970s linear models became the most common method for quantitative social scientists. More discussion on their scope and limitations is needed. We focus on one stage of the modeling process, namely, variable selection. We show that a rigorous comparison between bivariate and multivariate regression models should be done in this stage as non-orthogonality among predictors can lead to ambiguous estimates. Further, we use geometrical representations of linear models for two purposes. First, to visualize sources of instability and the causes of ambiguous results. Second, to support residual regression as an alternate approach. We illustrate our ideas using data collected by Cukierman et al. (2002) on the relationship between government regulation and inflation in 26 countries. Our conclusions stress the need to assess structural effects and support parsimonious models with few predictors.
Introduction
A substantial increase in the use of the linear modeling techniques in social science research has occurred since the late 1960 (Ollion, 2011; Cornwell, 2015: chap. 2). By linear models we refer to the statistical models in which the relationship between the expected value of a dependent variable E[Y] and a set of covariates, comprised in the model matrix X, is specified through a link function g(·), and a set of parameters, typically presented as a vector denoted by the character β.
Under a classic statistical approach, the β-parameters are linear, fixed and unknown. Several estimation techniques have been developed and incorporated into statistical packages, for a wide range of probability distributions, in particular for the so-called family of exponential distributions (Nelder and Wedderburn, 1972; McCullagh, 1984; Dobson and Barnett, 2008). This situation has contributed to increase the number of studies relying on linear models, termed Generalized Linear Models (GLM) after Nelder & Wedderburn’s (1972) seminal work. These models have served to produce and refine theories about social phenomena in multiple domains.
Considerable attention has been devoted to understanding the potential of this approach both technically and theoretically; less attention has been devoted, however, to its limitations. In technical terms, criteria for the selection of variables for the right-hand side of equation 1 and the interpretation of the estimates continue to be challenged in the literature (Rouanet et al., 2002; Deauvieau, 2010; Selz and Deauvieau, 2011; Bry et al., 2016). From a theoretical standpoint, concerns about the appropriateness of this approach to study certain social phenomena have been raised, mainly opposing it to non-linear/exploratory statistical techniques (Darras, 1966; Abbot, 1988; Hirschman, 1994; Desrosières, 2001, 2008; Bourdieu, 2005)
1
. To give the reader a sense of the tone of this discussion – at least in Sociology – we selected a quote from Hirschman’s article on theories of fertility change: The standard social science model is that society works pretty much like a regression equation: the task is to find the right set of predictors, solve the equation, and discover what factors are most important in predicting social outcomes. This framework does lead to empirical generalizations, but there seem to be endless qualifications about the measurement of variables, the meaning and interpretation of variables, the substitutability of one variable for another, and complex interactions with historical settings. If science is to discover parsimonious principles that explain complex patterns, we do not seem to be making progress (1994: 226).
Our aim is not to solve these discussions but to call for more caution when using linear models in the social sciences. We further suggest the need to relocate the opposition between explanation and description from the realm of statistical methods to the realm of theory. In other words, we contend that all statistical approaches can be explanatory (i.e. they can be used to develop theories about social behavior) when they are appropriately informed by theory. Hence, we call for more methodological openness when conducting quantitative research in the social sciences. 2
More specifically, we critically assess two stages of the process of modeling, namely variable selection and the interpretation of the estimates. The assessment of these two stages is crucial given the rapid growth of large data sets on social issues and the predominance of linear modeling techniques in quantitative social sciences. Based on a systematic comparison of bivariate and multivariate models and their geometric representation, we claim that linear modeling, as any other statistical technique, does not provide by itself an explanation of social phenomena. Rather, statistical methods should be understood as tools to make data intelligible under certain theoretical frameworks (Bourdieu and Wacquant, 1992).
Building and interpreting linear models
Reliability and interpretability are desirable characteristics for the estimates of a linear model. As the estimates may change depending on the variables that are included in a model, variable selection is a crucial stage. Technical and theoretical criteria to include or exclude predictors do not necessarily align. For example, the inclusion of a variable, very important for the theory at hand, can easily worsen the goodness of fit of the model. Depending on the main goal of a model (summary vs prediction) one may favor technical or theoretical arguments. Going back to the basis of modeling and to their geometric interpretation can be informative for the purposes of variable selection.
For long time, statisticians have warned analysts by recalling them that all models are wrong, but some of them are useful for research (Box, 1976). Useful models are typically models in which the predictors are: (i) few in number, (ii) well-clarified and (iii) measured with small error (Box, 1976; Mosteller and Tukey, 1977; Dobson and Barnett, 2008). Commandment number one speaks directly to the variable selection stage. It is well known that when predictors are correlated, estimation techniques are not able to capture the so-called true relationship between the dependent and the explanatory variables. This situation is commonly known as quasi-collinearity.
In principle, a useful research-model should not include many predictors. However, including multiple variables on the right-hand side of equation 1 is desirable insofar as it allows one to adjust estimates by theoretically-relevant factors. For instance, in economic studies on wage differentials by gender, it is important to control for factors such as educational attainment and economic sector if one is interested in measuring the level of discrimination against women in the labor market (Bruno, 2010). In epidemiological studies looking at the contribution of smoking to mortality, researchers often want to control for educational attainment and race, because of the potential relationship between these two variables and the outcome of interest (Ho and Elo, 2013). Another classic example can be found in demographic studies on the role of education in fertility decline. In this case, educational attainment is an explanatory variable (not a control) as it is thought to influence women’s fertility decisions. These studies have a strong case to control for: place of residence, migration status, occupation and wealth, given the large heterogeneity of fertility along these dimensions (Castro and Juárez, 1995).
Typically, research strategies start with a bivariate model where the outcome variable is predicted by the variable of interest (sex, smoking behavior, educational attainment). Further, control variables are added to the model. When both the direction and the significance of the estimates do not change after adding control variables, the interpretation supports the bi-variate association. For instance, if wage differences between men and women remain statistically significant after controlling for education and economic sector, the conclusion will go as follows: all things being equal, women have, on average, lower wages than men. On the contrary, changes in the magnitude of the estimates and their significance level after adding control variables are interpreted as if the control mediates the relationship between the variable of interest and the outcome. A classic example of this can be found in studies of educational attainment and migration. Vallet and Caille (1996) reported a negative association between educational attainment and migration. However, once the authors control for family background characteristics, this negative associations reverses. An analogous idea is at the core of demographic techniques of standardization. It is well known that comparisons of crude mortality rates can be misleading when the two populations of interest differ in their age-structure. Thus, adjusting – controlling – for age-structure is required to appropriately compare mortality levels (Preston et al., 2001; Deauvieau, 2011).
Control variables play the role of adjusting factors, yet the joint inclusion of several controls – over controlling – makes the adjusting process unclear. Potential ambiguous results as well as artificial significant results can arise when models are estimated including too many control variables. The sources of these potential problems can be pedagogically described using geometric representations and further assessed through specific measures.
Geometric representations of linear models help us do two things: (1) to reinterpret the role of control variables in a linear model, and (2) to identify one potential solution when ambiguous results occur. By geometric representation we refer to the generation of a tri-dimensional space for the variables – one dependent and two independent ones – and the presentation of regression outputs as orthogonal (bivariate models) and oblique projections (multivariate models) of the dependent variable on the independent variables and on the plane spanned by them, respectively (Le Roux and Rouanet, 2004: chap. 1). Given that more than three dimensions are impossible for us to visualize simultaneously, for cases involving more than three variables we rely on the ratio between the size of the orthogonal and the oblique projection as an indicator to evaluate the influence of control variables. In other words, the tri-dimensional case is used for pedagogical purposes and its generalization is presented throughout numerical outputs. As for a potential solution, geometric representations also show that by subtracting residuals from original variables one can obtain predictors that are orthogonal to one another. These new predictors can be used to estimate new models, a technique known in the literature as residual regression. Residual regression constitutes the last step of our analysis.
We apply these procedures to a set of linear models based on Cukierman et al.’s (2002) data to assess two aspects: (1) how geometric representations (and ratios) can be informative for the selection of variables and for the construction of orthogonal predictors (residual regression), (2) the extent to which author’s conclusions coincide with the conclusions we will draw from a more parsimonious model and from a residual regression approach. In this case, we find that geometric representations support the selection of a model with fewer predictors than a theoretical approach would suggest. However, we also find that, author’s conclusions basically hold for both models. We argue that more attention should be devoted to selecting variables and interpreting coefficients since there is no guarantee for the two approaches to always produce the same results.
Data and methods
Data
The data was originally recorded by Cukierman et al. (2002) to test the extent to which the independence of central banks from the state could affect inflation within 26 former socialist countries. Up to three years were recorded for each country between 1989 and 1998. The final data set contains 57 observations (country-year) for which the following variables were recorded (refer to Table 1 for descriptive statistics).
Descriptive Statistics, Standardized Names and Correlations Among Dependent and Explanatory Variables
Inflation is the dependent variable, while the remaining ones are separated into two groups. Variables that characterize the relative independence of central banking are treated as variables of interest, whereas the indicator for war is used as a control; that is, it is included to consider the potential effect of war on inflation.
Table 1 presents descriptive statistics and correlation coefficients for each of the six variables. Additionally, Table 1 shows the name for the standardized version of each variable that will be used for further comparisons.
There is a strong correlation among most of the variables. Eight out of the 15 correlation coefficients have an absolute value above 0.5. This is not surprising because the variables of interest are measuring the same concept (liberalization). Additionally, there are good theoretical reasons to expect a strong correlation between the variables of interest and inflation. Moreover, the PLI indicator is part of the GLI index, which implies correlation by construction. Similarly, as MIN is a positive function of GLI and IND, their correlation comes by construction. A weak correlation is recorded between PLI and WAR (−0.072) and between GLI and WAR (−0.199). We will refer to these results later.
Methods
We fit all possible models for each variable to check how estimates change across models. These models are specified by combining all predictors while keeping one of them at a time. For example, when we focus on the first variable (x1) we estimate all models with one, two, three, four and five predictors keeping x1 in all specifications. The first model uses only x1 as predictor, the second one uses x1 and x2, the third one x1 and x3, etc. The number of possible models per variable corresponds to the number of combinations of size p taken a group of n variables, noted as
We then assess the variability of the coefficients associated to each variable with respect to the baseline model. Small changes are not problematic because the interpretation remains the same. Instead, substantial changes – i.e. changes in the direction of the association – can be problematic as they reduce the intelligibility of the coefficients. In other words, unstable estimates imply that one could draw contradictory conclusions depending on the set of variables included in the model. Based on these results we select a model that only includes variables with stable results across specifications – we term this model parsimonious model. Further, we provide the geometric representation of all the bi-variate, the parsimonious and the full model. Finally, we use residual regression as an alternative to avoid potential misinterpretation of linear models’ estimates. This technique makes all predictors orthogonal to one another, making estimates robust to the inclusion of several predictors.
As a reminder, geometric representations serve two purposes. First, they provide evidence on the usefulness of interpreting regression outputs as orthogonal and oblique projections among vectors. Second, they help us to visualize the cases in which the inclusion of a control can lead to an increase, a decrease, or a reversal in the size of an estimate. Each of these cases is presented as an area within the correlation circle, and factors associated to the size of each area are discussed.
Results
Bivariate, parsimonious and full models
Table 2 presents the bivariate associations between each of the five independent variables and inflation. These figures give an imperfect information but more stable/intelligible than the information coming from a full model as they do not depend on other variables (Rouanet et al., 2002). These numbers provide references for further comparisons.
Coefficients and R2 for Bivariate Models
As the variables are standardized, the bivariate associations are equal to the correlation coefficients presented in Table 1. For x1 (degree of independence of the central banking) the correlation is −0.464, which implies that among the observed country-years, an increase in the independence of the central bank is associated with a decrease in inflation. The proportion of variance explained by this model corresponds to the square of the coefficient (−0.464) 2 = 0.215. All the other three variables of interest also exhibit negative associations with inflation. These results are consistent with the economic theory behind the model. Conversely, the variable related to war conditions has a positive association. This is not unexpected given the well-known inflationary effects of war.
Figure 1 summarizes the results from all the 80 models that were fitted (16 per variable). We are aware some models are duplicated; however, we chose to keep them all as the analysis is carried out by variable. The left panel displays the distribution of the 48 coefficients per variable (16 are non-redundant). For instance, the first boxplot (x1) corresponds to the estimated coefficients for the variable x1, i.e. β1 from all models that include x1. The gray line corresponds to the bivariate association (i.e. model including only x1). On the right panel, each model is represented by several points depending on the number of covariates (e.g. four points for a model with four covariates), the R2 and the estimated association between y and the covariates. For example, the top-most square marker (□) on the right panel correspond to one out of the four representations of the model y ∼ x1 + x2 + x3 + x5. The x-coordinate of this point is the R2 associated to this model (0.5), and the y-coordinate corresponds to the estimate of β1, i.e. the conditional association of x1 and y. The point is represented with a square as the variable of interest is x1. This very same model is represented by three other points. These three points have the same x-coordinate (R2 = 0.5) but different y-coordinate and marker, depending on the variable of interest. The y-coordinates for these three points correspond to the estimates of β2 (0.336, ▴-triangle), β3 (-0.107, ^-circle) and β5 (−1.1, X-marker). 4

Summary results for models by variable of interest.
A gray background was added to points representing models with two covariates. These models are particularly interesting for the following reason: if there are two important factors influencing inflation, namely war conditions and liberalization, then two good measures of these concepts should be sufficiently informative, i.e. they should explain a large proportion of the variance, while providing consistent estimates for the association between them and the outcome variable. Two segments were added to point-out the specification with two covariates that has the largest R2, namely, y ∼ x2 + x5.
Three conclusions can be derived from this figure. The coefficients associated to variable x2 are stable across models. This represents “the good case” and is due to the relatively low correlation between x2 and the other variables (−0.072, −0.199, −0.220, −0.236 see Table 1). For variables x1, x3, x4 and x5 coefficients are unstable. As compared to the bivariate association, the sign and magnitude of these coefficients vary from one model to another. This occurs to a lesser extent for x5, for which the estimated coefficients do not change sign. This situation can be termed as “the bad case” to the extent that in some cases the relationship is positive and in some others the relationship is negative. Even though these changes may have a theoretical explanation (such as in the cases discussed in the previous section), in a context of several control variables (more than 3) an exploration of the sources of instability is necessary as it is hard to believe that a single theoretical explanation could accommodate multiple changes in several coefficients at a time. Among the models with two variables (x2 being one of them), the one that includes variables x2 and x5 has the largest R2. Not surprisingly given the low correlation between x2 and x5 (see Table 1).
On the one hand, the model that includes x2 and x5 constitutes a parsimonious alternative to study the relationship between independence of central banking and inflation, controlling for the potential effect of war. On the other, the full model includes the five variables of interest, as the theory suggests that all these dimensions of liberalization can influence inflation. We termed this model saturated. For the sake of brevity, we label the saturated model as (A) and the parsimonious one as (B). Cukierman et al. relied on the full model and their main conclusion goes as follows: Once the process of liberalization has gone far enough legal independence turns out to be effective in slowing inflation down. […] The cumulative index of liberalization developed by de Melo et al. (1996) exerts a significant negative influence on inflation, as is the case in their paper, mainly at low levels of cumulative liberalization (Cukierman et al., 2002: 19).
Comparison of Different Models
The bottom panel in Table 3 displays as set of ratios termed structural effects. Structural effects correspond to the ratio between the bivariate association and the estimates of models A and B for each variable. For example, the ratio between the bivariate association of x1 and y, and the estimate for x1 in model A is computed as (0.551/−0.464) = −1.19. This ratio reflects a reversal in the sign of the coefficient and an increase of 19% in its absolute value when all variables are included. In other words, compared to the bivariate association of x1 and y, the conditional association of x1 and y in model A is larger and it has an opposite sign. β1 in model A is not intelligible since its size and direction become ambiguous. This case corresponds to an extreme scenario (sign reversal).
Structural effects in model A are large. Two coefficients (β1, β4) changed their sign, while the rest displayed changes in their magnitude. The smallest reduction occurred to β2, which is 27% smaller in model A compared to the bivariate specification (Structural effect = 0.73). For the pair of variables with the highest correlation, (ρ(x1, x5) = +0.931) the conditional association is larger than the bivariate one. Conversely, structural effects in model B are substantially smaller compared to those in model A (more stable coefficients). The more information is added to the model, the higher the risk of affecting the intelligibility of the coefficients due to the potential redundancy of the variables. Geometric representations give us visual tools to explore the sources of this instability.
Geometric representations
A standardized variable can be represented by a vector of norm equal to one. Then, all the variables in a linear model with one dependent variable and two predictors can be represented within a sphere of ratio equal to one. The correlation coefficient ρ(i,j) between two variables (xi, xj) corresponds to the squared cosine of the angle (θ) between any given pair of vectors (Le Roux and Rouanet, 2004).
If two variables are not correlated, their geometrical representation will correspond to two orthogonal vectors (θ = 90º) because Cos(90°) = 0. Weak correlations correspond to angles close to the right angle. If the variables are positively correlated the angle between the vectors will lay between −90° and 90° (excluding both extremes). If the correlation is negative, the angle between them will pertain to the both-sides-opened-interval (90°, 270°). In all cases θ is a distance index that satisfies the triangle inequality, and the relation ρ = cos(θ) is an index of proximity defined in the interval [0,1] (Le Roux and Rouanet, 2004: chap. 1). Figure 2 displays the geometric representation of variables y, x2 and x5. Figure 2 also displays orthogonal and oblique projections of y on each variable (x2 and x5) and on the plane they generate. These projections constitute the geometric representation of the regression outputs.

Correlation sphere with three reduced variables.
The regression of y on x2 corresponds to the orthogonal projection of y on x2 (labeled y2). Note that the residual of this regression (
All the aforementioned linear models can be defined in terms of orthogonal and oblique projections of the dependent variable (y) on the hyperplane formed by any combination of the independent variables (x1, x2, x3, x4 and x5). The length of the orthogonal projections of y on each variable, noted as yi (with i = 1, 2, 3, 4 and 5), corresponds to the bivariate association between variable xi and y, the length of the oblique projection of y on xi, noted as
Figure 3 displays a cross-sectional view of Figure 2 to further explore model B. As established above (see Table 2), the conditional associations are smaller than the bivariate ones given the correlation among the two variables. Note that if x2 and x5 were independent (i.e. θ = 90°) both the bivariate and the conditional associations will coincide. As it is not the case, vectors x2 and x5 delimit areas in which the bivariate and the conditional associations differ. The gray and the white semicircles within the circumference in Figure 3 correspond to areas of reversal and areas of non-ambiguous differences between bivariate and conditional associations, respectively. By reversal we mean that the conditional association has the opposite sign as compared to the bivariate one; by non-ambiguous change we mean that there is either accentuation or attenuation of the association in the same direction. The size of the gray area (reversal effect) depends on θ, i.e. on the correlation between the independent variables.

Correlation circle. Regression of y on (x2, x5).
It follows that model B can be written as the sum of the two oblique projections
For illustrative purposes and without any loss of generalization, equation 4 can be expressed as the sum of two oblique projections:

Regression of y on (x2, x1345).
In general, conditional and bivariate associations between a set of covariates and an outcome may differ. The potential difference among them is determined by the correlation of the variables included in the model (angles that influence the projections). Examples shown above led us to identify three cases: accentuation (which includes the emergence of the association), attenuation (which include the dissipation of the association) and reversal (changes in the sign of the association).
When these situations occur, we say that there is an effect due to the data’s structure. We define the magnitude of this effect as the ratio between the conditional and the bivariate association (structural effect). If this ratio is larger than 1 it implies that the conditional association is bigger than the bivariate one (accentuation). If the ratio is below one and above zero there is attenuation, and if the ratio is negative there is reversal. The borderline cases among these three situations are stabilization (between attenuation and accentuation), disappearance (between attenuation and reversal) and emergence (between accentuation and reversal). For a rigorous study of such cases see the work by Rouanet et al. (2002), where a scheme of rose des vents is used to summarize the above-described situations.
Geometric representation suggests that residuals can be used to avoid the influence of the correlation among the variables of interest and the control. The residual of each regression is orthogonal with respect to the hyperplane generated by the independent variables. Using this fact, it is possible to use residuals as an orthogonal version of each variable of interest. This process is known as residual regression and will be described in the next section from a geometric point of view.
Residual regression
Each of the variables of interest (x1, x3, x4 and x5) was regressed on the control variable (x2) to obtain a new version of each of them in relation to the control. This new version is orthogonal to the control variable x2. We note these new variables as

Residual of x5 with respect to x2.
The dependent variable (y) was regressed on the orthogonal variables (
Even though the coefficients in models A and B coincide with the coefficients in models A⊥ and B⊥ (refer to Table 2), this equivalence is not an identity. The residual variables have a smaller standard deviation given that they correspond to orthogonal projections. To make adequate comparisons in terms of the size of the associations between the control and the variables of interest, the latter must be re-parametrized based on the variables’ standard deviations.
In model B⊥ we have that:
Solving for
Now the coefficients are comparable. The ratio −0.504 / 0.425 = 1.19 implies that the residual association between inflation and the multiplicative independence of the central bank (
The same procedure can be applied to model A. Using the expressions for the orthogonal decomposition and standardizing the coefficients based on the variable’s standard deviations, we present in equation 9 the final version of model A.
Where
Conclusions and discussion
Stability of coefficients in bivariate and multivariate models is a crucial aspect to perform rigorous research in social sciences given the growing nature of large and diverse data sets. Evidence presented here suggests that both the number of predictors and the correlation among them have a direct impact on the stability of the estimates. The more variables are included, the higher the risk of having unstable estimates. The same can be said about the correlation among predictors: the higher the correlation among independent variables, the more erratic the coefficients would be. Parsimonious models shall be preferred as opposed to saturated ones. Even though we only presented one application, we believe our results can be generalized to contexts where analyses are conducted on observational data to the extent that the variables of interest and the control variables may be correlated.
In that scenario, several and correlated predictors may lead to ambiguous results, i.e. results where the direction of the relationship between the predictors and the outcome can be reversed when comparing bivariate and conditional associations. It is unlikely that a single theory could account for all potential changes across a large set of predictors (say, more than three). The ratio between the conditional association and the bivariate one can be used to measure the stability of the coefficients – we termed this ratio structural effect. Although there is not a clear threshold for this ratio, either a reversal (change in the sign) or a substantial change in the estimates can be taken as signals of redundant information in the model. Starting from bivariate models and analyzing the structural effects is a good methodological practice in the process of variable selection. Moreover, reporting the bivariate association in a comparative fashion with respect to the conditional ones is key to assess results from multivariate regression analysis. There may also be potential paradoxical situations regarding the statistical significance of the coefficients, that is, substantial changes in the p-value. However, we leave that discussion for another work.
Geometric representation served the purpose of illustrating analytically the causes of these paradoxical results. Bivariate and multivariate regression can be represented as orthogonal and oblique projections. These representations help us in the variable selection process insofar as they display both the regression outputs and the multiple correlations among predictors simultaneously. We showed that areas of ambiguous results can be relevant if the correlation among the independent variables is large. Furthermore, in a model with several predictors these areas become difficult to assess visually, and the evaluation of structural effects is crucial. In sum, reporting bivariate and conditional associations along with structural effects can help us to assess our conclusions when using multivariate linear modeling techniques.
These suggestions are relevant for any research that uses linear models on observational data, broadly defined as data sets in which the marginal distribution of the independent variables is not controlled beforehand, i.e. when the correlation among predictors is a feature of the data on its own. 5 Overlooking this aspect may lead to ambiguous results as the inclusion of an additional control (or its omission) can produce two things: (1) a substantial change in the magnitude of the associations, (2) a change in their sign. If both associations are reported, a direct comparison can be made between them. Moreover, by comparing structural effects one can assess the extent to which the association between the variables of interest and the outcome is driven by the structure of the data, rather than by a true link between the outcome and the predictors. We believe these recommendations speak in particular to quantitative social scientists since most of the data we use are survey data, where several control variables are available and pertinent. A similar reflection ought to be done about event-history and multilevel models (Courgeau and Lelièvre, 1997; Courgeau, 2003). These two types of models constitute improvements to ordinary linear models as defined here as they account for features such as right and left censoring and interdependence among levels; yet they also share important characteristics with the former (Dobson and Barnett, 2008).
Residual regression is an alternative to avoid areas of ambiguous results. Geometric representations show that results from this approach need to be re-parametrized due to the reduction of the variance in the orthogonal version of the predictors. Taking this into account yields either more convincing evidence to the question at hand, as in the case of our example, or can point out the necessity of more parsimonious models. This result suggests that caution ought to be exercised when using multivariate regression models.
All in all, we feel confident claiming that linear models should be used carefully when applied to research within the social sciences. Despite the attractiveness of including several controls, this practice should be, in principle, avoided or at least be accompanied by a systematic analysis of the structural effects. From this perspective, linear modeling appears as a tool to explore data structure – i.e. multivariate associations among outcomes and predictors – rather than a mechanism that explains social phenomena by itself. Our analysis confirms the two-way relationship between theory and methodology – which are often presented as separate matters. No theory can exist without rigorous data analysis that supports it, and all empirical analysis need to be theoretically informed. Consequently, explanation in the social sciences is not a matter of the statistical tool at hand, but a matter of making the appropriate connections between theory and methods.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Financial support provided for Andrés’ doctoral studies by the Fulbright Commission and the Population Studies Center at the University of Pennsylvania is gratefully acknowledged.
