Lasso Regularization for Selection of Log-linear Models: An Application to Educational Assortative Mating

Abstract

Log-linear models for contingency tables are a key tool for the study of categorical inequalities in sociology. However, the conventional approach to model selection and specification suffers from at least two limitations: reliance on oftentimes equivocal diagnostics yielded by fit statistics, and the inability to identify patterns of association not covered by model candidates. In this article, we propose an application of Lasso regularization that addresses the aforementioned limitations. We evaluate our method through a Monte Carlo experiment and an empirical study of educational assortative mating in Chile, 1990–2015. Results demonstrate that our approach has the virtue, relative to ad hoc specification searches, of offering a principled statistical criterion to inductively select a model. Importantly, we show that in situations where conventional fit statistics provide conflicting diagnostics, our Lasso-based approach is consistent in its model choice, yielding solutions that are both predictive and parsimonious.

Keywords

log-linear models Lasso regularization assortative mating stratification

Introduction

Sociology has had a long-standing interest in categorical inequalities, starting with black/white and male/female dichotomies, to more complex classifications. Underlying this interest lies the assumption that some categories, such as race, class, and education, cannot be fully captured by gradational or scalar forms (Tilly 1999). Methodologically, log-linear models have played a central role in the quantitative study of these categories and their relationship to inequality and population change. This is because they allow researchers to explore patterns of association between categorical variables, instead of reducing them to aggregated-level measures (e.g., correlation coefficients). Indeed, most seminal studies on interracial marriage (e.g., Gullickson 2005; Qian 1997), assortative mating (e.g., Mare 1991; Schwartz and Mare 2005), intergenerational mobility (e.g., Duncan 1979; Hout 1984; Mare 1991), and migration flows (e.g., Little and Raymer 2013; Raymer and Rogers 2007; Willekens 1983) have used log-linear models for contingency tables as their main analytic strategy. These models are still the standard tool in the analysis of assortative mating (Schwartz 2013) as shown by recent publications in the discipline’s leading journals (e.g., Gullickson and Torche 2014; Schwartz 2010; Schwartz and Mare 2012; Schwartz, Zeng, and Xie 2016; Torche 2010).

More specifically, the main goal of log-linear models is to achieve an accurate description of the patterns of association between categorical variables while avoiding overfitting¹ the data. To resolve this trade-off between descriptive accuracy and parsimony, researchers typically rely on a combination of theory-based model specifications and goodness-of-fit statistics. Because log-linear models typically involve a large number of parameters, often statistics that penalize model complexity—such as the Bayesian information criterion (BIC) and the Akaike information criterion (AIC)—are preferred.

Despite its wide use, under some circumstances, this approach to model selection is problematic. First, it has been shown that goodness-of-fit statistics yield equivocal diagnostics regarding the “best” model when (i) the sample size is too small, (ii) the number of parameters is large, and/or (iii) tables are sparse (Clogg 1982; Dziak et al. 2015; Fitzmaurice and Goldthorpe 1997; Weakliem 1999). Unfortunately, combinations of these issues are common in empirical research. In such situations, the researcher faces the necessity of choosing a model specification based on limited statistical information, having to rely on prior knowledge on the subject. As a consequence, methodological decisions tend to be ad hoc, not always tractable and prone to subjective biases. Second, while this approach leads to selecting the best model among candidates, it is, by definition, insensitive to patterns of association that are not covered by the models under comparison. Thus, if a set of parameters are consequential for the descriptive capacity of the model but are not incorporated by any candidate specification, the researcher risks offering an overly simplistic representation of the phenomenon under investigation (see Hauser [1980] for an early attempt to address this issue).

To deal with these particular limitations, we introduce an innovative approach to specification and selection of log-linear models based on an application of the Least Absolute Shrinkage and Selection Operator (Lasso) regression (Tibshirani 1996). The Lasso is an established regularization tool, which contains desirable properties for the aforementioned problems. Under conditions of sparse data and/or small sample size, this tool yields solutions that increase out-of-sample predictive power by preventing overfitting (Hastie, Tibshirani, and Friedman 2009; James et al. 2013). In addition, taking advantage of Lasso’s data-driven approach to variable selection, our application serves as a “discovery tool” that allows for the emergence of patterns potentially masked by solely theory-driven models.

In this article, we illustrate our proposed method by applying it to the study of educational assortative mating. Our analysis proceeds in two steps. First, we implement a Monte Carlo simulation to compare the performance of Lasso and conventional fit statistics. Findings show that our method recovers the simulated pattern of assortative mating, which was not the case with the conventional goodness-of-fit statistics. Second, we apply our method to an empirical case where conventional goodness-of-fit statistics yield inconsistent diagnostics. We demonstrate how our application of Lasso provides a systematic procedure to inductively decide on a model specification under these circumstances. Different Lasso solutions lead to consistent model specifications, which were both predictive—according to out-of-sample cross-validation—and parsimonious.

Together, the main contribution of this article is to provide researchers using log-linear models an alternative approach to select a model specification that achieves descriptive accuracy without sacrificing parsimony. Importantly, in situations where conventional fit statistics provide equivocal diagnostics, our approach has the virtue, relative to ad hoc specification searches, to offer a principled statistical criterion to inductively select a model. As mentioned, log-linear models have been the preferred strategy for the study of social mobility, racial intermarriage, and assortative mating in sociology. As data in these fields have increasingly gained complexity through the use of administrative records and big data (e.g., social media), we think an inductive and data-driven approach to model selection will be an especially useful tool for researchers. Finally, to facilitate other scholars’ use of Lasso regularization for selection of log-linear models, we provide a detailed code with all the necessary steps to implement our method in R (see Section A.5 in the Online Appendix).

Log-linear Analysis and Model Selection

Log-linear Models for Contingency Tables

Log-linear models for contingency tables have been the standard approach to characterize patterns of social mobility, migration flows, and assortative mating. The basic goal of these models is to describe the patterns of association and interaction between the different levels of categorical variables (Agresti 2002). In particular, these models provide estimates of the association between categorical variables while controlling for changes in their marginal distribution. This is an important trait as researchers often want to disentangle the net association between categorical variables from compositional changes in the population. For example, in the case of assortative mating, it is important to distinguish between the net increase of educational homogamy, from the overall increase of educational attainment in the population due to educational expansion.

In contrast to linear regression, in log-linear models the outcome variable is the frequency of a phenomenon of interest, and both the outcome and the explanatory variable appear symmetrically rather than causally related in the model (Powers and Xie 2000). In other words, the estimation of the causal effect of a treatment variable on a particular outcome is not the purpose of log-linear analysis. Indeed, the central goal is to describe the patterns of association between two categorical variables, not focusing on their causal relationship.

Equation ( 1) describes a saturated log-linear model for the contingency table resulting from the cross-tabulation of three categorical variables, here denoted as row (R), column (C), and layer variable (L):

log F_{i j k} = λ_{0} + λ_{i}^{R} + λ_{j}^{C} + λ_{k}^{L} + λ_{i j}^{R C} + λ_{i k}^{R L} + λ_{j k}^{C L} + λ_{i j k}^{R C L},

here $F_{i j k}$ denotes the expected frequency in the $i j k$ th cell of the table, where ${i = 1, \dots, I}$ indexes the row variable, ${j = 1, \dots, J}$ indexes the column variable, and ${k = 1, \dots, K}$ indexes the layer variable. Main effects are given by $λ_{i}^{R}$ , $λ_{j}^{C}$ , and $λ_{k}^{L}$ , differences in the marginal distribution of rows and columns by the layer variable are denoted by $λ_{i k}^{R L}$ and $λ_{j k}^{C L}$ , respectively, while the partial association between row and columns is denoted by $λ_{i j}^{R C}$ . Finally, $λ_{i j k}^{R C L}$ indicates the three-way interaction between the row, column, and layer variables.

In log-linear modeling, the saturated model is the starting point of model selection. As observed in equation (1), this model incorporates all information existing in the data (i.e., one parameter for each data point) thus leading to a perfect fit. Evidently, such model is of little use for the researcher because it does not provide a parsimonious account of the association between variables; it just contains the observed patterns in the data.

Typically, parameters are estimated by fitting a Poisson model predicting the vector of counts derived from the contingency table, where the levels of each variable operate as predictors. Consequently, the log of the expected value of counts can be expressed as:

log E [F | X] = X β,

where $β$ are obtained by maximizing $ℓ (β | X, Y)$ , the log-likelihood function of the data.

Conventional Approach to Model Specification and Selection

In order to characterize, the social or demographic patterns of interest researchers typically test several theory-driven model specifications against the saturated model. These specifications are parsimonious constructs, each representing a theory about the processes that lead to the observed patterns in the data. In the particular case of assortative mating—our empirical application—most theory-driven specifications are variants of two model families, that is, topological models, such as homogamy and crossing specifications, and ordinal models (Powers and Xie 2000). Importantly, the goal of these theory-driven specifications is to characterize marriage patterns by educational attainment over time not to explain changes in educational assortative mating using particular predictors.

Homogamy models test whether individuals are more likely to marry partners with their same level of educational attainment. For instance, the constrained homogamy model assumes assortative mating patterns are captured by the marginal distribution of husbands’ and wives’ educational attainment, plus one three-way interaction that models educational homogamy across the main diagonal of the contingency table for each year (Powers and Xie 2000). Equation (3) formalizes the homogamy model. In contrast with the saturated model described in equation (1), in this model, changes in assortative mating are captured by a single parameter $γ_{k}^{Y} d_{i j}$ , which represents the changes in the log odds of educational homogamy in year k relative to a baseline year. Where $d_{i j} = 1$ if husband’s education $i$ is equal to his wife’s education $j$ and $d_{i j}$ = 0 otherwise:

log F_{i j k} = λ_{0} + λ_{i}^{H} + λ_{j}^{W} + λ_{k}^{Y} + λ_{i j}^{H W} + λ_{i k}^{H Y} + λ_{j k}^{W Y} + γ_{k}^{Y} d_{i j},

where $d_{i j} = I [i = j]$ .

Different versions of this model relax some of the assumptions described above, allowing a more complex pattern of educational homogamy. For example, related models allow the strength of homogamy to vary across levels of educational attainment or incorporate parameter(s) to capture a/symmetric patterns of intermarriage along the minor diagonals of the contingency table.

In the case of crossing models, the association between spouses’ education is represented as a series of barriers to intermarriage between education groups. The hypothesis implied by this model is that different categories of education present varying degrees of difficulty for crossing (Powers and Xie 2000; Schwartz and Mare 2005). Formally,

log F_{i j k} = λ_{0} + λ_{i}^{H} + λ_{j}^{W} + λ_{k}^{Y} + λ_{i j}^{H W} + λ_{i k}^{H Y} + λ_{j k}^{W Y} + γ_{i j k}^{H W Y},

where

γ_{i j k}^{H W Y} = {\begin{array}{l} \sum_{q = j}^{i - 1} γ_{q k} & if i > j \\ \sum_{q = i}^{j - 1} γ_{q k} & if i < j \\ 0 & otherwise \end{array},

Here, the $γ_{q k}$ parameter represents the variation in the difficulty of crossing educational barrier q in year k relative to a baseline year. The crossing parameters capture the log odds of marriage between individuals in adjacent schooling categories relative to the log odds of homogamy, net of the marginal distributions of spouses’ education (Schwartz and Mare 2005). Thus, as expressed in equation (5), for couples that cross more than one barrier, the log odds of intermarriage are calculated by adding the parameters of each barrier crossed. As with homogamy models, crossing models permit relaxing some of its assumptions, as well as implementing hybrid specifications that combine different families of models.

Finally, another class of models incorporate the ordinal nature of the variables in the contingency table. Ordinal models assume these categories are ranked on either an observed or a latent scale and use this information to obtain parsimonious model specifications (Powers and Xie 2000). For example, in the linear-by-linear model, the association between row and column variables are scaled as a linear-by-linear interaction and expressed by a single parameter $β_{k}^{Y} u_{i} v_{j}$ (Hout 1983):

log F_{i j k} = λ_{0} + λ_{i}^{H} + λ_{j}^{W} + λ_{k}^{Y} + λ_{i j}^{H W} + λ_{i k}^{H Y} + λ_{j k}^{W Y} + β_{k}^{Y} u_{i} v_{j},

where $u_{i}$ and $v_{j}$ denote the measured attributes of husbands’ and wives’ educational attainment and $β_{k}$ is the association parameter in year k. These models assume that scores $u_{i}$ and $v_{j}$ are known prior to modeling; in practice, this means researchers impose an external score structure on these terms (Powers and Xie 2000). In the case of the log-multiplicative layer effect model (also known as Unidiff), the scaling scores for spouses’ educational attainment are empirically estimated from the data² (Erikson and Goldthorpe 1992; Xie 1992). Formally,

log F_{i j k} = λ_{0} + λ_{i}^{H} + λ_{j}^{W} + λ_{k}^{Y} + λ_{i k}^{H Y} + λ_{j k}^{W Y} + ψ_{i j} ϕ_{k},

where $ψ_{i j}$ describes the association between spouses’ education and $ϕ_{k}$ indicates the year-specific deviations in this association. This term depicts how the pattern of assortative matting differs across years (Raymo and Xie 2000).

Adjudication across different model specifications is one of the key steps in log-linear analysis. Because models typically involve a large number of parameters, direct examination of parameter estimates across different models is often unfeasible. For this reason, it is customary to first select a model among several specifications and then examine the parameters of the chosen model to draw substantive conclusions. In this vein, the development of goodness-of-fit statistics to guide scholars in the process of model selection became a prolific field of academic discussion (Burnham and Anderson 2004; Grusky and Hauser 1984; Kuha 2004; Raftery 1995; ). These statistics include the likelihood ratio ( $G^{2}$ ), the dissimilarity index ( $D$ ), the AIC, and the BIC,³ among others.

The main goal of this process is to select a model specification that accurately fits the data while preserving parsimony. Given the high dimensionality of the parameter space, researchers commonly rely on statistics that penalize model complexity, in combination with their own previous knowledge on the case of study. In particular, the BIC is often the preferred statistic because it maximizes out-of-sample prediction and incorporates a penalty to the number of parameters—scaled by the log of the sample size—thus preserving model parsimony.⁴

Although widely used in the discipline, this approach to model selection can sometimes be problematic. First, under specific conditions, goodness-of-fit statistics yield equivocal diagnostics regarding which is the best model. Indeed, it is well known that different statistics tend to prefer ill-fitted models when the sample size is small (Clogg 1982; Weeden and Grusky 2005). For instance, it has been shown that if the null hypothesis is false and sample size is small, the BIC still tends to favor the null hypothesis, leading to underfitting (Atkinson 1978; Weakliem 2004). Yet a less recognized issue is that a large sample size, when combined with a large number of parameters, can induce the BIC to impose severe penalties on additional parameters. In these scenarios, (1) the prior distribution will tend to favor the null hypothesis as the marginal proportions become more unequal as $n$ increases and (2) cells that are irrelevant for a specific alternative hypothesis will be counted as evidence in favor of the null (Weakliem 1999). This favors simplistic models over alternatives that may have a better fit but are less parsimonious (Dziak et al. 2015; Weakliem 1999). To further complicate matters, whether $n$ is considered too small or too large highly depends on the interaction of different aspects in the empirical application (for details, check Dziak et al. 2015; Weakliem 1999).

In addition, scholars have shown the AIC is not a consistent criteria as it always contains some probability of selecting models that are too large, leading to overfitting (Hurvich and Tsai 1989; Kuha 2004). In particular, in the AIC, Type II error rates—retaining a false null hypothesis—decrease as $n$ increases, but the probability of Type I error—rejecting a true null hypothesis—remains constant and never approaches zero (Dziak et al. 2015). In other words, in contrast to the BIC, the AIC does not compensate the decrease in the probability of retaining a false simpler model (Type II error) with a decrease in the probability of rejecting a true simpler model (Type I error) as $n$ gets larger. Finally, goodness-of-fit statistics can yield misleading diagnostics when matrices are sparse (Chen and Chen 2012; Fitzmaurice and Goldthorpe 1997; Grusky and Hauser 1984). Indeed, in sociological research, “sampling zeros” are often encountered due to the nature of the phenomenon itself (i.e., strong association between variables induces small counts in some cells) coupled with finite sample sizes. Combinations of these circumstances are common in empirical studies, leaving the researcher with the necessity of choosing a model specification based on limited statistical information (Dziak et al. 2015). As a consequence, methodological decisions are not always tractable and are more prone to subjective biases.

Second, while this approach leads to selecting the best model among candidates, it is, by definition, insensitive to patterns of association that are not covered by the specifications under comparison. Thus, if a set of parameters are consequential for the descriptive capacity of the model but are not incorporated by any candidate specification, the researcher risks offering an overly simplistic representation of the phenomenon under investigation. In recognition of this limitation, Hauser (1980) proposed a data-driven method for specification of log-linear models, based on iterative fitting and examination of residuals to detect empirical patterns that are missed by theory-driven model specifications. In practice, this approach has not been widely used, presumably because it is labor intensive and runs the risk of overfitting the data.

An Alternative Approach: Model Selection via the Lasso

Overview of Lasso Regularization

In order to cope with some of the limitations explained above, we propose the use of the Lasso (Tibshirani 1996) as an alternative approach to model specification and selection. Lasso is a regularization technique for regression in which a penalty is introduced to constrain coefficient estimates by shrinking them toward zero (Hastie et al. 2009; James et al. 2013). The purpose of regularization is to increase out-of-sample predictive power, prevent overfitting, and facilitate interpretation when models involve a large number of parameters (potentially more parameters than observations). As with all regularization methods, the Lasso trades a reduction in variance for an increase in the bias of the estimates. Currently, the Lasso is one of the most established regularization methods in the statistical literature, being successfully applied to various fields⁵ (Jaggi 2014).

In its canonical form, the Lasso is a penalized version of ordinary least squares (OLS) where parameter estimates are obtained by minimizing the sum of squared errors, subject to a constrain on the sum of the absolute values of coefficients.⁶ In its Lagrangian form, we can express the Lasso problem as an optimization of the following objective function:

{\hat{β}}_{λ} = \underset{β}{argmin} {\frac{1}{N} | | Y - X β {| |}_{2}^{2} + λ | | β {| |}_{1}},

where $λ$ is an externally defined tuning parameter that controls the strength of the Lasso penalty and, consequently, the level of sparseness of the solution. It can be easily observed that when $λ = 0$ the Lasso is equivalent to OLS, but as $λ$ increases, the Lasso penalty will cause some coefficients to shrink and, importantly, set some coefficient estimates exactly equal to zero. Because the Lasso does not have a closed form solution, the entire Lasso path (solutions for a range of $λ$ values) is computed through a modification of the least angle regression algorithm, and the value of $λ$ is chosen through cross-validation (see Online Appendix A2 for an overview of cross-validation), with the goal of minimizing the expected out-of-sample prediction error while preventing overfitting (Hastie et al. 2009; James et al. 2013).

Lasso penalties can also be applied to generalized linear models (Zou and Hastie 2005). In the case of Poisson models—the current standard tool for estimation of log-linear models—the penalized log-likelihood to be minimized is:

{\hat{β}}_{λ} = \underset{β}{argmin} {- \frac{1}{N} ℓ (β | Y, X) + λ | | β {| |}_{1}},

where $ℓ (β | Y, X)$ is the Poisson log-likelihood for observations ${y_{i}}_{i = 1}^{N}$ . Refer to equation (2) to see the general log-likelihood formulation.

Our Approach: Specification and Selection of Log-linear Models via the Lasso

The reliance on goodness-of-fit statistics for model selection is problematic under some conditions. As mentioned, when researchers are working with small sample sizes, sparse matrices or large samples with many parameters, these statistics tend to not coincide in their diagnostic. This situation is not rare in the context of log-linear modeling. To deal with model selection under these conditions, we propose the use of Lasso regularization for contingency tables. We argue that the Lasso has desirable properties that make it a suitable device for this endeavor.

In what follows, we show that complementing the conventional approach to log-linear models with Lasso regularization may help researchers to circumvent some of the aforementioned obstacles. Moreover, being an inductive data-driven method, it can also assist researchers as a “discovery” tool in the process of deciding which model specifications will be compared via goodness-of-fit statistics.

Our proposed method unfolds in three core steps. First, we create a $n$ -dimensional contingency table resulting of the cross tabulation of all variables of interest (e.g., education of the partners and survey year). Such contingency table is then turned into a standard database format, where the table’s cells become a vector of frequencies that we treat as the outcome variable, while the variables that originate the table are treated as predictors.

Second, we fit a saturated model to predict the counts in the contingency table. Such model specification allows for the full interaction between the discrete variables that generate the contingency table⁷ and thus comprises all the available information in the table. Consequently, when estimated via unpenalized maximum likelihood—the standard estimation approach to generalized linear models—the saturated model dedicates one parameter to each observation (i.e., each cell), fitting the data perfectly and thus yielding a nonparsimonious output. Our approach consists of estimating the parameters in the saturated model for the contingency table using a Poisson model with a Lasso penalty. The key property that makes the Lasso a suitable device for the task at hand is that the $ℓ_{1}$ norm induces sparsity in the estimate and thus it yields a parsimonious solution without having to rely on an a priori reduction of the parameter space. In other words, because the Lasso shrinks coefficients toward zero by a constant amount (a function of $λ$ ), parameters whose absolute value is less than such threshold are shrunk to exactly zero. This feature makes the Lasso not only a regularization method but also a variable selection tool.

Moreover, because our starting point is the saturated model for the contingency table, the selection of variables performed by Lasso is equivalent to a data-driven specification of a log-linear model. That is, while in principle every cell in the contingency table has a parameter attached, Lasso will set to zero the coefficients for those cells that are irrelevant for improving model fit. Thus, by inducing sparseness, Lasso will transform an initial situation with equal number of parameters and observations into a classic parsimonious model, one with less parameters than observations.

Third, we select the strength of the penalty (controlled by the parameter $λ$ ), and the corresponding regression coefficients, using 10-fold cross-validation⁸ and Poisson deviance as measure of cross-validation error.⁹ In particular, we choose the value of $λ$ that minimizes the overall cross-validation error, thus maximizing the predictive capacity of the model and preventing overfitting. Formally, we can express the fitted model as follows:

log \hat{F} = S {\hat{β}}_{λ},

where $\hat{F}$ is the predicted vector of counts in the contingency table, $S$ is the design matrix for the saturated model, and ${\hat{β}}_{λ}$ is the vector of Lasso regression coefficients for a chosen value of the tuning parameter $λ$ . The result is a sparse vector of coefficients which can be interpreted in the same fashion as traditional log-linear models.¹⁰

Together, we claim there is an important parallelism between ours and the conventional approach to log-linear models in social science. In particular, we can think of any topological model specification as a saturated model where some parameters are a priori set to zero to provide a stylized representation of the world. Analogously, our approach accomplishes the same purpose, but the shrinkage decisions are inductive and data driven. The three steps described above constitute the core elements of our application of Lasso regularization to log-linear models. As we will illustrate in “Empirical Application: Educational Assortative Mating in Chile 1990–2015” Section, researchers might face additional methodological choices when applying Lasso regularization in empirical settings. Before delving into these particular decisions, we highlight two general caveats.

First, it is worth noticing that there are many ways to represent a saturated model when predictors are categorical. While the choice of coding scheme does not affect estimation procedures in standard models (Luo et al. 2016), this is not the case in regularized categorical regression (Chiquet, Grandvalet, and Rigaill 2016). In this context, both estimates and the predictive performance yielded by different penalties have been shown to be sensitive to coding schemes. Although there are no preferable penalty–coding combinations, some authors advocate for the use of coding schemes that yield meaningful reference levels and encourage users to check the sensitivity of solutions to different coding schemes (Chiquet et al. 2016; Tutz and Gertheiss 2016). Throughout this article, we use dummy coding for all independent variables, taking the lowest educational level of each partner and the starting year in the time series as reference categories. We choose this representation because this is the convention in the log-linear model literature, but also because it enables for a clearer interpretation of parameters, allowing to disentangle between changes due to marginal distributions from transformations in assortative behavior.

Second, its important to warn the reader that Lasso regularization is not always an appropriate tool for model selection. If the goal of a particular model is to draw inferences on the causal effect of $X$ on $Y$ conditioning on a series of control variables, Lasso regularization will likely generate omitted variable bias (Belloni, Chernozhukov, and Hansen 2013). The latter is because the Lasso is ultimately a predictive tool, and therefore variables that are not highly predictive of $Y$ , but that are correlated with $X$ might be set to zero, generating omitted variable bias (Leeb and Pötscher 2008). In contrast, the Lasso is an appropriate tool for model selection in log-linear analysis, precisely because the goal of these models is not causality.

Empirical Evaluation

In this section, we implement and evaluate the performance of our method in two steps. First, we validate our approach in a realistic Monte Carlo simulation regarding educational assortative mating. For illustration purposes and because we control the data generating process, we apply the simplest version of our Lasso-based approach (as described in “Our Approach: Specification and Selection of Log-linear Models via the Lasso” Section) to analyze the simulated data. In a nutshell, this analysis shows that the Lasso recovers the data generating process set in the simulations much closely than the conventional tools for model selection. Second, we apply our method to the study of assortative mating in Chile and compare the patterns yielded by the conventional approach versus Lasso regularization. For this case, in which we do not know the data generating process a priori, we present the reader with additional methodological tools that might make the use of Lasso regularization more feasible in empirical settings. In particular, we introduce additional ways to control the Lasso penalties, as well as statistical and substantive criteria to evaluate the performance of different models. Findings indicate that our approach provides a systematic procedure to inductively decide on a model specification under circumstances where goodness-of-fit statistics yield equivocal diagnostics. In contrast, different Lasso solutions lead to consistent model specifications, which are both predictive—according to out-of-sample cross-validation—and parsimonious. In addition, we show how insights from our inductive approach can be used in combination with traditional log-linear models.

A Monte Carlo Experiment: Simulated Patterns of Educational Assortative Mating

We evaluate the performance of both the conventional approach to model selection and our Lasso-based method at identifying simulated patterns of educational assortative mating. In this simulation, we create a stylized version of assortative mating where there is over time increasing educational homogamy among college graduates and decreasing homogamy among individuals with only an elementary education. In addition, we allow for heterogamy in the minor diagonals of the contingency table, where educational hypogamy and hypergamy have the same strength and do not change over time. Outside the major and minor diagonals, all assortative mating is driven by the marginal distributions, which are set to be time invariant (see details on the data generating process in Online Appendix A4). To be consistent with our empirical case described in the next section, we simulate the data for an equal number of survey years as in the Chilean case, that is, 12 measurements that span a period of 25 years.

We choose this data generating process for two reasons. First, the patterns of association set in our simulation have a clear correspondence to one of the theory-driven model specifications (model 7 below). Thus, ensuring that such specification could effectively recovered with the conventional approach to model selection. Second, our simulated pattern of assortative mating encompasses only a subset of the parameters considered in the aforementioned model specification. This makes it a suitable setting to test whether our Lasso-based method is able to accurately describe the data, while still retaining model parsimony (i.e., not using unnecessary parameters).

We characterize the simulated pattern of assortative mating through two strategies. First, following the conventional approach to model selection, we test several specifications regarding the mating structure and adjudicate across models using goodness-of-fit statistics. More specifically, model 1 (independence) assumes that all association between husband’s education (H), wife’s education (W), and year (Y) is explained by their marginal distribution, and thus the three variables are independent of each other. Model 2 (conditional independence) assumes that the educational similarity of partners is time invariant but allows the distributions of husband’s education and wife’s education to vary over time. Substantively, this means there are no changes in the net association between husband’s and wife’s education over time. Model 3 builds on model 2 but adds a special parameter to capture changes in homogamy across years that is constrained to be the same over educational levels. Model 4 relaxes model 3 by allowing the homogamy parameter to vary by educational category. Model 5 goes back to a constrained parameter for homogamy but allows symmetrical movements across the minor diagonals, where these movements are allowed to vary over time. Model 6 relaxes these movements by allowing hypergamy to be different than hypogamy. Models 7 and 8 are similar to Models 5 and 6, respectively, but allow an unconstrained major diagonal. Moreover, model 9 introduces a classical crossing model. Model 10 adds to model 9 a constrained major diagonal. Further, model 11 relaxes the assumption made on model 10, allowing heterogeneous homogamy. Specification 12 is a linear-by-linear model, where a single term¹¹ captures the linear-by-linear interaction between husband’s and wife’s education. This parameter is allowed to vary over time. Lastly, specification 13 is a log-multiplicative layer effect model (Unidiff) where the layer (L) corresponds to year.¹²

As a second strategy, we use Lasso regularization in order to perform data-driven, automatized model selection.

Given that BIC is asymptotically consistent¹³ but performs poorly in small samples and/or on sparse data, we implement these two approaches for three simulated sample sizes, one of small size (average of 19 counts per cell and total count of 5,700), one medium sized (average of 117 counts per cell and total count of 35,100), and one of very large size (average of 833 counts per cell and total count of 249,900). Furthermore, the pattern of assortative mating we imposed implies that some regions of the contingency table are heavily overpopulated, resulting in serious sparseness for the small size sample (e.g., 40 percent of the table’s cells present a count lower than 10 and 17 percent of cells have counts lower than 5) but no sparseness for the medium and very large size sample (Figure 1 displays the distribution of cell frequencies in all three scenarios). Comparing the performance of the two described approaches across different sample sizes and spareness levels gives us insights on the conditions under which—if any—Lasso can effectively complement the diagnostics of goodness-of-fit statistics for selection of log-linear models. We compare the performance of both model selection strategies by evaluating which one most closely captures the patterns set in the simulated data.

Figure 1.

Monte Carlo experiment. Distribution of cell frequencies in simulated contingency tables. (A) Small sample size (19 counts per cell on average), (B) medium sample size (117 counts per cell on average), and (C) very large sample size (833 counts per cell on average).

Findings with conventional approach to selection of log-linear models

Given that we control the data generating process, we know beforehand that among the different log-linear specifications, the simulated pattern of assortative mating is represented by a model with an unconstrained major diagonal and symmetric movements along the minor diagonals (model 7 in Tables 1 –3). Nevertheless, Tables 1 –3 show that goodness-of-fit statistics suggest other specifications as the preferred model. In the case of the small sparse data, the BIC and AIC show a strong preference for the simplest model specification (model 2 in Table 1), one where assortative mating is entirely driven by the marginal distributions and over time changes in education. When the same analyses are conducted on a medium size sample (Table 2), both the BIC and AIC prefer a slightly more complex specification, but they do not agree on a preferred model: While the AIC chooses a model with unconstrained homogamy (model 4 in Table 2), BIC prefers a conditional independence model (model 2 in Table 2). Only when we use a very large and nonsparse data, the AIC and BIC coincide on a preferred specification, an unconstrained homogamy model (model 4 in Table 3) that only partially captures the patterns of assortative mating set in the simulation. Lastly, both AIC and BIC rate model 7—a specification able to capture the underlying true model—as the second best choice.

Table 1.

Log-linear Models of the Association between Husband’s and Wife’s Educational Attainment, Simulated Data.

Model	df	G ²	100 × D	AIC	BIC
(1) [W][H][Y]	280	2,538.59	26.21	1,978.59	2,702.98
(2) M1 + [HW][WY][HY]	176	215.69	6.31	−136.31	1,279.94
(3) M2 + [D][DY]	165	202.74	5.89	−127.26	1,362.16
(4) M2 + [d][dY]	121	133.55	3.73	−108.45	1,673.68
(5) M3 + [S][SY]	154	187.16	5.48	−120.84	1,441.76
(6) M3 + [A][AY]	143	181.25	5.39	−104.75	1,531.03
(7) M4 + [S][SY]	110	116.20	3.35	−103.80	1,751.51
(8) M4 + [A][AY]	99	110.66	3.18	−87.34	1,841.15
(9) M2 + [C][CY]	132	152.97	4.49	−111.03	1,597.92
(10) M9 + [D][DY]	121	138.97	4.34	−103.03	1,679.10
(11) M9 + [d][dY]	99	106.28	3.13	−91.72	1,836.76
(12) Linear by Linear	66	68.33	2.58	−63.67	2,084.34
(13) Unidiff	165	205.02	6.10	−124.98	1,459.62

Note. W = wife’s education; H = husband’s education; Y = year; Diag = diagonal constrained; d = main diagonal unconstrained; S = symmetric movements minor diagonal; A = asymmetric movements minor diagonal; C = crossing parameters.

Table 2.

Log-linear Models of the Association between Husband’s and Wife’s Educational Attainment, Simulated Data.

Model	df	G ²	100 × D	AIC	BIC
(1) [W][H][Y]	280	13,978.78	24.19	13,418.78	14,177.67
(2) M1 + [HW][WY][HY]	176	399.91	4.21	47.91	1,687.46
(3) M2 + [D][DY]	165	386.54	4.08	56.54	1,789.23
(4) M2 + [d][dY]	121	88.38	1.34	−153.62	1,951.65
(5) M3 + [S][SY]	154	340.86	3.77	32.86	1,858.70
(6) M3 + [A][AY]	143	334.50	3.74	48.50	1,967.48
(7) M4 + [S][SY]	110	83.46	1.31	−136.54	2,061.88
(8) M4 + [A][AY]	99	76.70	1.22	−121.30	2,170.26
(9) M2 + [C][CY]	132	165.14	2.09	−98.86	1,913.27
(10) M9 + [D][DY]	121	136.56	1.92	−105.44	1,999.84
(11) M9 + [d][dY]	99	68.24	1.12	−129.76	2,161.80
(12) Linear by Linear	66	47.86	0.93	−84.14	2,486.87
(13) Unidiff	165	346.50	3.76	16.50	1,864.34

Table 3.

Log-linear Models of the Association between Husband’s and Wife’s Educational Attainment, Simulated Very Large Data (Average n per Cell = 833).

Model	df	G ²	100 × D	AIC	BIC
(1) [W][H][Y]	280	102,399.45	24.47	101,839.45	102,635.59
(2) M1 + [HW][WY][HY]	176	2,042.66	3.41	1,690.66	3,571.37
(3) M2 + [D][DY]	165	1,981.82	3.31	1,651.82	3,647.24
(4) M2 + [d][dY]	121	98.45	0.54	−143.55	2,310.72
(5) M3 + [S][SY]	154	1,730.80	2.94	1,422.80	3,532.93
(6) M3 + [A][AY]	143	1,719.23	2.93	1,433.23	3,658.07
(7) M4 + [S][SY]	110	88.83	0.51	−131.17	2,437.82
(8) M4 + [A][AY]	99	77.20	0.45	−120.80	2,562.90
(9) M2 + [C][CY]	132	620.82	1.42	356.82	2,696.38
(10) M9 + [D][DY]	121	410.42	1.24	168.42	2,622.69
(11) M9 + [d][dY]	99	77.95	0.49	−120.05	2,563.65
(12) Linear by Linear	66	58.10	0.43	−73.90	2,953.95
(13) Unidiff	165	1,808.73	3.19	1,478.73	3,610.87

Overall, we can see that different goodness-of-fit statistics yield different diagnostics, leaving the researcher in need of choosing a model specification on the basis of her or his prior knowledge on the case of study. Moreover, while ambiguity in the diagnostics is more severe when using small size sparse data, this limitation is not entirely solved by working with a medium-sized sample. By contrast, when the analyses are conducted on a very large nonsparse data, AIC and BIC yield comparable diagnostics and are more or less able to identify the true model among the candidates. These results are consistent with the well-studied asymptotic properties of these two goodness-of-fit statistics and their poor performance on small and/or sparse data.

Findings with selection of log-linear models via the lasso

In this section, we approach model selection via the Lasso. First, we report performance metrics of the Lasso for the three samples. Figure 2 illustrates how coefficients are shrunk under different values of $λ$ . The starting point in each graph is a saturated model with a $λ$ equal to zero that contains as many parameters as cells in the contingency table. At bigger values of $λ$ , the number of coefficients shrunken to zero increases, reaching the null model when $λ$ is sufficiently large.

Figure 2.

Monte Carlo experiment. Path of Lasso regularized coefficients from the saturated model. (A) Small sample, (B) medium size sample, and (C) very large sample.

Following standard practice, we choose the value of $λ$ through cross-validation (Hastie et al. 2009). As can be seen in Figure A2 in the Online Appendix, we select the tuning parameter for which the cross-validation error rate is the smallest. In particular, we choose a $λ$ penalty based on Friedman, Hastie, and Tibshirani’s (2010) recommended criterion, that is, the largest value of $λ$ such that the mean cross-validated error is within 1 standard error of the minimum.¹⁴ This process yielded a $λ$ value very close to zero for the sparse and small size sample, a small $λ$ value for the medium size sample, and a large $λ$ value for the larger sample. Thus, Lasso regularization is more conservative—performs less shrinkage—when sample size is smaller. In addition, even the most conservative Lasso solution uses approximately 55 of the 300 possible parameters, which is more parsimonious than almost all the conventional log-linear specifications tested in “Findings with Conventional Approach to Selection of Log-Linear Models” Section.

Second, Figure 3 plots model coefficients estimated through Lasso regularization for Poisson regression for all sample sizes. Coefficients represent the log odds for each combination of partners’ educational level relative to the baseline categories (i.e., both partners with “less than elementary” in 1990). At a first glance, we can see that despite their difference in the number of parameters, the general patterns of assortative mating are strikingly similar for the three sample sizes. Findings indicate there is a remarkable increase in the log odds of homogamy between individuals with a college degree. In addition, both plots show a decrease in the log odds of homogamy between husbands and wives with only elementary education, while minor diagonals are symmetric and remain stable over time. Thus, it can be observed that regardless of sample size and data sparseness, our approach effectively recovered the pattern of assortative mating that we set in the simulation. These results stark in contrast to those yielded by the AIC and BIC, which only approximated the true data generating process when the data were very large and nonsparse.

Figure 3.

Monte Carlo experiment. Log odds for combinations of partners’ education according to Lasso coefficients. (A) Small sample, (B) medium size sample, and (C) very large sample. Log odds were calculated from fitted frequencies.

Lastly, in order to evaluate the stability of these solutions, we take advantage of the random nature of the cross-validation procedure used to select the value of $λ$ . More specifically, we repeat the 10-fold cross-validation over 600 iterations, obtaining 600 $λ$ values with their corresponding estimates. Tables A2–A4 in the Online Appendix report the Lasso estimates for all sample sizes, plus the average estimates across iterations and the 95 percent intervals for the empirical distribution of parameters.¹⁵ The estimated coefficients are highly stable, showing limited variation at different $λ$ values. Ultimately, this illustrates the stability of the chosen penalty across the random iterations.

Empirical Application: Educational Assortative Mating in Chile 1990–2015

In this section, we empirically demonstrate the use of Lasso regularization for selecting log-linear models by applying this technique to the study of assortative mating. More specifically, we assess how educational assortative mating has changed in Chile the last 25 years. This country’s sustained reduction of income inequality over the last decades together with its rapid growth in educational attainment—especially postsecondary education—provides a unique scenario to test how marriage patterns change under these transformations. Cross-sectional evidence indicates that Chile features very strong barriers to intermarriage at the top of the educational distribution but a more fluid exchange elsewhere (Torche 2010). Evidence regarding the evolution of assortative mating is, however, scant. An exception is a study by Esteve, McCaa, and López (2013) who, using census data, show that educational homogamy in Chile is highest among college graduates¹⁶ and has increased since the 2000s.

We study assortative mating in Chile using data from the National Socio-Economic Characterization Survey (CASEN), the most commonly used data set for social research in this country. This household survey is conducted every two or three years since 1985 by the Chilean Government, sampling around 70,000 households each time. Data are representative at the national, regional, urban, and rural levels. For this analysis, we pool all CASEN surveys from 1990 to 2015.¹⁷ We restrict this analysis to the subsample of prevailing marriages and cohabitating couples. Ideally, we would focus only on newlyweds as prevailing unions are subject to different sources of bias, such as educational upgrade after marriage and selective union dissolution (Schwartz and Mare 2005; Torche 2010). Unfortunately, our data did not allow to identify these newly formed couples. In addition, we only include couples where males partners are between 30 and 35 years old to ensure that most of the cohort that enters a union is observed as such (Torche 2010). Thus, the total sample size is of 55,255 couples or 110,510 individuals.

We measured educational attainment of each spouse using five categories: “Less than Elementary” (E−), “Elementary Completed or Some High School” (E), “High School Completed or Vocational Degree” (H), “Some College or Technical Degree” (C−), and “College Degree or Higher” (C). We analyze the three-way contingency table resulting from the cross-tabulation of these two variables and survey year.¹⁸ This table has an average of 184 counts per cell and is moderately sparse, with 20 percent of cells having less than 10 observations and 10 percent having less than five counts.

In order to characterize the Chilean pattern of assortative mating and its evolution over time, we first use log-linear models for contingency tables. We test several well-known model specifications, each corresponding to a different hypothesis regarding assortative mating. For each model, we report conventional goodness-of-fit statistics. Additionally, we apply Lasso regularization as a data-driven approach to specification and selection of log-linear models. For this, we implement a Poisson regression with Lasso penalties over a saturated model for the contingency table. More specifically, we estimate four versions of this general model. These versions differ in the weights that we apply to the Lasso penalties for each coefficient (see Section A3 in Online Appendix for more details on weighted Lasso penalties). These weights reflect our previous knowledge regarding assortative mating and allow us to impose restrictions over the Lasso penalties on a substantive basis. We examine the robustness of these different shrinkage regimes because in this case, unlike in our Monte Carlo experiment, we do not have a priori knowledge on the data generating process.

Lasso free: Equally weighted Lasso penalties are imposed to all parameters of the saturated model. This is the same procedure we use in our Monte Carlo experiment.

Adaptive Lasso: The adaptive Lasso is an extension of the Lasso. In particular, it applies a weighted penalty that is inversely proportional to the absolute value of an initial estimate of the parameter. The main goal is to favor predictors with previously known importance to avoid spurious selection (Zou 2006; Huang et al. 2008). We use as initial estimates the parameters yielded by the log-linear saturated model, so that parameters with large coefficients are mildly penalized, while parameters with small coefficients are more heavily penalized.¹⁹

Lasso independence: We leave unpenalized all parameters corresponding to marginal distributions, while the remaining parameters are subject to Lasso penalties. We call this model “Lasso independence” because its unpenalized part is equivalent to the log-linear model of independence.

Lasso independence + [WY][HY]: This model extends the restriction imposed in the Lasso independence approach. Here, all parameters corresponding to the marginal distributions, as well as those capturing changes over time in the educational distribution of partners, are not penalized. All remaining parameters are subject to Lasso penalties. The idea is to ensure that parameters capturing educational assortative mating are not biased due to the shrinkage of coefficients that describe the marginal distribution of variables and their changes over time.

Unlike in the simulation study presented in “A Monte Carlo Experiment: Simulated Patterns of Educational Assortative Mating” Section, in the empirical study of assortative mating, we do not know the underlying data generating process. Thus, in order to evaluate the descriptive capacity of all models according to a common metric, we implement a cross-validation procedure. Because cross-validation serves the purpose of evaluating the predictive capacity of a model out of sample, we believe that this tool can complement the diagnostic yielded by traditional goodness-of-fit statistics that penalize model complexity (i.e., BIC and AIC).

Findings with conventional approach to selection of log-linear models

We fit different model specifications to describe over time trends of assortative mating. As mentioned, each of these specifications depicts a hypothesis pertaining to marriage patterns across time. Table 4 shows the goodness-of-fit statistics corresponding to each model specification.

Table 4.

Log-linear Models of the Association between Husband’s and Wife’s Educational Attainment (Husbands Aged 30–35): Chile, 1990–2015.

Model	df	G ²	100 × D	AIC	BIC
(1) [W][H][Y]	280	43,984.64	33.66	43,424.64	44,192.12
(2) M1 + [HW][WY][HY]	176	321.81	2.39	−30.19	1,664.96
(3) M2 + [D][DY]	165	273.20	1.75	−56.80	1,736.46
(4) M2 + [d][dY]	121	156.12	0.92	−85.88	2,099.86
(5) M3 + [S][SY]	154	237.30	1.57	−70.70	1,820.68
(6) M3 + [A][AY]	143	215.69	1.49	−70.31	1,919.19
(7) M4 + [S][SY]	110	134.16	0.83	−85.84	2,198.02
(8) M4 + [A][AY]	99	112.27	0.65	−85.73	2,296.24
(9) M2 + [C][CY]	132	180.75	1.30	−83.25	2,004.37
(10) M9 + [D][DY]	121	145.12	1.03	−96.88	2,088.86
(11) M9 + [d][dY]	99	107.77	0.70	−90.23	2,291.75
(12) Linear by Linear	165	240.27	1.99	−89.73	1,703.54
(13) Unidiff	165	300.11	2.22	−29.89	1,883.49

In particular, we see that these statistics do not agree on which is the best model for the Chilean case. According to the BIC, the conditional independence model (model 2) is the best alternative, followed by the liner-by-linear association model (model 12). These specifications represent very different patterns: While in model 2, educational assortative mating is stable over time, in model 12, the linear association between spouses’ education is allowed to vary across years. Alternatively, the AIC indicates a preference for model 10—a crossing model with constrained major diagonals. Not only do the BIC and AIC not coincide on which is the best model, but they actually yield contradictory results. Indeed, according to the AIC, BIC’s preferred model (model 2) has almost the worst fit across all specifications. Finally, the $D$ shows a preference for a model with unconstrained homogamy and asymmetric movements along the minor diagonals (model 8), and the $G^{2}$ selects a crossing model with an unconstrained major diagonal (model 11).

As mentioned, in a situation like this, the researcher faces the necessity of selecting a model specification on the basis of previous knowledge about the subject. We suspect this leads to decisions that are not always tractable and more susceptible to subjective biases.

Findings with selection of log-linear models via the lasso

We now address the same problem but using Lasso regularization to characterize the pattern of educational assortative mating in Chile. We present results corresponding to the four variants of Lasso models described above. For each model, we report coefficients corresponding to an adaptively chosen value $λ = λ^{*}$ of the tuning parameter. Again, we choose the largest value of $λ$ such that the mean cross-validated error is within 1 standard error of the minimum. Figures A3 and A4 in Online Appendix show the path of regularized coefficients and cross-validation error for each model’s chosen $λ$ .

As can be observed in Tables A5 to A8 in the Online Appendix, all four Lasso solutions yield an estimated vector of coefficients that is highly sparse, using at most 150 parameters of the 300. This makes Lasso solutions more parsimonious than almost all the conventional model specifications tested earlier. Figure 4 plots the log odds of partners’ education yielded by each model. Higher log odds imply a stronger association between the particular level of education of each spouse conditional on survey year. In general, it can be seen that all four models depict a scenario where assortative mating is mostly explained by educational homogamy and educational heterogamy along the minor diagonals—couples where one of their members has one extra level of education. As can be seen in the graphs, these associations vary in strength depending on the educational level of the partners. For instance, educational homogamy is consistently stronger among individuals with a college degree, as shown by the purple line in column C in each panel. This is followed by unions where both members have some college (red line, column C− in each panel) and high school degrees (green line, column H in each panel). Similarly, across Lasso models, heterogamy along the minor diagonals is the strongest for couples were husbands have a college degree and wives have some college education (purple line, column C− in each panel). Interestingly, this reveals the presence of educational hypergamy, which is stable across the observed period.

Figure 4.

Log odds for combinations of partners’ education with respect to reference categories, Chilean data. (A) Lasso free, (B) adaptive Lasso, (C) Lasso independence, and (D) Lasso independence + [WY][HY]. Log odds were calculated from fitted frequencies.

Despite these similarities, the main source of discrepancy between different Lasso models has to do with over time changes in these associations, especially the evolution of homogamy among couples with a college degree. In this regard, the first three models (“Lasso free,” “adaptive Lasso,” and “Lasso independence”) indicate that homogamy among college graduates was relatively high and stable between 1990 and the mid-2000s, followed by a more or less sharp increase in the next decade. Some of these specifications also indicate a similar increase among homogamous couples with some college (Lasso free, panel A), and couples where the wife is a college graduate and the husband has some college education (Lasso free, panel A and Lasso Independence, panel C). In contrast, the model in which the marginal distributions and the variables capturing over time educational expansion are not penalized (“Lasso independence + [WY][HY]”) indicates that the levels of homogamy and heterogamy along the minor diagonals remain stable over the entire analyzed period.

Overall, these results depict two different representations of the evolution of educational assortative mating in Chile: one where there are high and rising levels of homogamy and heterogamy among the highly educated, and another where homogamy and heterogamy among the highly educated is also high but stable over time. However, because we do not know the data generating process, we need additional information to decide which model provides the best representation of the underlying mating process. To accomplish this, in the next section, we evaluate the four Lasso models and the conventional log-linear specifications according to their out-of-sample predictive capacity using a repeated $k$ -fold cross-validation procedure.

Our cross-validation procedure

In order to evaluate the performance of the models introduced above, we use a repeated k-fold cross-validation with 10 folds and 10 repetitions. That is, we divide the data into 10 random partitions and fit each model using a training set (9/10 of the data randomly selected) and create predictions in a “testing set” (the unused 1/10 of the data). We compare these predictions to the observed outcome in the testing set and measure the predictive accuracy of each model using the Poisson deviance, a proper loss function for Poisson distributed outcomes. We iterate this entire process 10 times, so that all partitions of the data serve as testing set once. Furthermore, we repeat the process 10 times in order to prevent the possibility that the randomness of the data partition might affect the results. At the end of the process, we average out the cross-validation error metrics computed at each iteration (100 in total), obtaining an overall cross-validation error for each model.

It can be observed that all models—with the exception of the independence model—perform relatively similar in terms of predictive accuracy (see Table 5). Among all models, the “Lasso independence + [WY][HY]” has the lowest deviance, closely followed by the “Lasso independence” model. As shown in Figure 4, the representations of the Chilean pattern of educational assortative mating yielded are remarkably alike between these two models (panels C and D). However, they differ regarding the evolution of educational homogamy among college educated couples. While the “Lasso independence” model predicts a smooth increase in homogamy starting in the mid-2000s, the “Lasso independence + [WY][HY]” suggests that such rise is entirely explained by the population expansion of college graduates (i.e., change in the marginal distribution). This discrepancy is expected since, by construction, the “Lasso independence + [WY][HY]” model does not penalize changes in the marginal distribution of spouses education over time, while the “Lasso independence” model is able to shrink these terms. In the particular case of Chile, this model discrepancy might be of special relevance. Indeed, it has been documented that the pattern of educational assortative mating is isomorphic to the pattern of income inequality (Torche 2010) and that income inequality and social immobility are mostly driven by concentration and closure at the very top of the social ladder (Torche 2005). Thus, accurately describing trends in assortative mating among the highly educated population can be crucial for understanding changes in the distribution of resources and opportunities.

Table 5.

Error Metrics for Competing Models from 10 Iterations of 10-fold Cross-Validation.

Model	Poisson Deviance	Average df
(1) [W][H][Y]	68,120.96	280.00
(2) M1 + [HW][WY][HY]	64,229.96	176.00
(3) M2 + [D][DY]	64,227.85	165.00
(4) M2 + [d][dY]	64,222.27	121.00
(5) M3 + [S][SY]	64,223.74	154.00
(6) M3 + [A][AY]	64,224.50	143.00
(7) M4 + [S][SY]	64,224.21	110.00
(8) M4 + [A][AY]	64,224.71	99.00
(9) M2 + [C][CY]	64,223.68	132.00
(10) M9 + [D][DY]	64,226.51	121.00
(11) M9 + [d][dY]	64,225.86	99.00
(12) Linear by Linear	64,410.71	66.00
(13) Unidiff	64,228.61	165.00
(14) Lasso free	64,789.09	183.23
(15) Lasso adaptive	64,733.52	230.14
(16) Lasso indep	64,219.88	163.66
(17) Lasso indep + [WY][HY]	64,167.56	159.02

Note. W = wife's education; H = husband's education; Y = year; Diag = diagonal constrained; d = main diagonal unconstrained; S = symmetric movements minor diagonal; A = asymmetric movements minor diagonal; C = crossing parameters.

In order to resolve this discrepancy, we examine a traditional log-linear model specification able to capture the main features detected by the Lasso, with the advantage of yielding unbiased estimates. The latter would combine the inductive knowledge gained by the Lasso with unbiased estimates provided by conventional Poisson models. This strategy has also being successfully applied for the case of OLS models by Belloni and Chernozhukov (2013), where OLS post-Lasso estimates have a smaller bias. In our application, the two preferred Lasso models yield a common pattern of assortative mating: different levels of homogamy by educational level and heterogamy along the minor diagonals. Yet, as mentioned, they differ in whether there is changes over time among college educated couples. In this case, a model with an unconstrained major diagonal and asymmetric movements along the minor diagonals (model 8) theoretically captures the main commonalities of the two preferred Lasso models.

Figure 5 plots the patterns of assortative mating generated by this model. These trends confirm that a higher extent of educational homogamy exists among couples with high levels of education, as well as heterogamy along the minor diagonals.²⁰ Importantly, it suggests a slight increase in homogamy among college graduates, which lead us to favor the results yielded by the “Lasso independence” model over the “Lasso independence + [WY][HY].” We arrive at this conclusion by combining the insights of both traditional log-linear specifications and Lasso models. It is specially reassuring that these findings regarding educational homogamy at the top, and its evolution over time, are consistent with those reported by previous research using Chilean census data (Esteve et al. 2013). Lastly, it is important to note that the sole inspection of conventional goodness-of-fit statistics does not provide clear evidence to select model 8 on a statistical basis. The only statistic that preferred this specification was the Dissimilarity Index ( $D$ ), rarely used in as the main selection criterion. Furthermore, if we had followed the often used BIC, we would have chosen model 2, leading us to characterize educational assortative mating trends as stable over time. This illustrates the substantive implications of solely relying on conventional goodness-of-fit statistics for model selection.

Figure 5.

Log odds for combinations of partners’ education with respect to reference categories, Chilean data. Parameters from log-linear model CI + [Diag][DiagY][Asym][AsymY]. Log odds were calculated from fitted frequencies.

Thus, our empirical analysis demonstrates that researchers could apply the insights from the Lasso approach to inductively set a log-linear model specification. In particular, given that the Lasso adds bias to the estimates in order to reduce variance, the patterns of assortative mating shown in the regularized regression coefficients can be used to inform the specification of a regular Poisson model. The latter will yield unbiased estimates while reducing the risk of misspecification.

Final Remarks

In this article, we introduce an innovative approach to model specification and selection based on Lasso regularization. Importantly, in situations where conventional fit statistics provide equivocal diagnostics, our approach has the virtue, relative to ad hoc specification searches, of offering a principled statistical criterion to inductively select an appropriate model. In addition, this approach can assist researchers as a discovery tool in the process of deciding which model specifications will be compared via goodness-of-fit statistics.

We illustrate our proposed approach in two steps. First, we implement a Monte Carlo simulation to compare the performance of Lasso and conventional fit statistics. Findings show that our method recovers the simulated pattern of assortative mating, which was not the case with the conventional goodness-of-fit statistics. Second, we apply our method to an empirical case where conventional goodness-of-fit statistics yield inconsistent diagnostics. We demonstrate how our application of Lasso provides a systematic procedure to inductively decide on a model specification under these circumstances. Different Lasso solutions led to consistent model specifications, which were both predictive—according to cross-validation—and parsimonious.

In addition, we demonstrate how our proposed method could complement conventional approaches to log-linear models for contingency tables. Such approach embraces McFarland, Lewis, and Goldberg’s (2016) idea of “forensic social science,” combining both inductive and theory-driven methods to gain insights about complex assortative mating patterns in a statistically informed fashion.

Nevertheless, it is important to note some caveats regarding this approach. First, as all regularization methods, Lasso induces bias to the estimates in exchange for a reduction in variance. Second, Lasso solutions run the risk of overlooking specific effects, especially if their size is small. This can be problematic when interested in the value of particular coefficients. Third, if misused, it can lead to excessive reliance on automatized model selection without conferring proper attention to theory. For these reasons, we recommend evaluating the results of our Lasso-based method in combination with those yielded by conventional model selection approaches (e.g., BIC). Lastly, this approach is optimally suited to discover log-linear specifications of the topological family. Ordinal models, such as the linear-by-linear or log multiplicative layer effect model, are not comprised in the model space of our Lasso-based approach, and thus they cannot be directly discovered. Nevertheless, researchers can assess whether the Lasso-based solution has a better performance than different types of models (including ordinal models) using cross-validation.

We conclude by underscoring some research areas, outside assortative mating, for which Lasso regularization could be a useful method for sociologists. A natural application of our method would be to the study of social and occupational mobility. While log-linear models have been the preferred analytical strategy in this area, in recent years, the complexity of new occupational classifications (Jonsson et al. 2009; Weeden and Grusky 2012) has generated contingency tables that are typically high dimensional and sparse, compromising the performance of conventional goodness-of-fit statistics. Precisely, we think that our proposed method would provide a direct solution to these issues, as Lasso regularization contains especially desirable properties as a model selection tool under conditions of sparse data. In addition, scholars have used log-linear models to analyze migration patterns across regions (Little and Raymer 2013; Raymer and Rogers 2007; Willekens 2016). For this, a parsimonious representation of the migration structure is chosen by using conventional goodness-of-fit statistics. Thus, if one model specification fits the data well, this model is used to indirectly estimate migration flows. Similarly to the case of assortative mating, under some circumstances, fit-statistics might not coincide in their diagnostic, and thus implementing our approach as a principled statistical criterion to inductively specify and select a model would be particularly helpful. More broadly, Lasso regularization offers a potential solution to concerns on information asymmetry and uncertainty in model selection decisions in sociology (Young 2009; Young and Holsteen 2017). As theories can be tested in a myriad of ways, model decisions may have paramount implications for the researcher’s results. In this vein, the implementation of a principled statistical criterion to inductively select an appropriate model—via the Lasso—could contribute to increase the tractability and transparency of model selection decisions. Future work could further elaborate on the details of such a procedure for a broader set of modeling techniques.

Supplemental Material

Supplemental Material, Appendix - Lasso Regularization for Selection of Log-linear Models: An Application to Educational Assortative Mating

Supplemental Material, Appendix for Lasso Regularization for Selection of Log-linear Models: An Application to Educational Assortative Mating by Mauricio Bucca, and Daniela R. Urbina in Sociological Methods & Research

Footnotes

Acknowledgments

We are grateful for helpful feedback on this work from Jeremy Cohen, Dan J. DellaPosta, Fedor A. Dokshin, Adeline Lo, Ian Lundberg, Mario Molina, Radu Pârvulescu, Brandon Stewart, and Martin T. Wells. This paper also benefited from feedback received at the PAA Annual Meeting 2017, the RC28 Summer Meeting 2017, and the Annual Popfest Conference 2017. Finally, we want to thank three anonymous reviewers for their valuable comments.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Research reported in this publication was supported by The Eunice Kennedy Shriver National Institute of Child Health & Human Development of the National Institutes of Health under Award Number P2CHD047879. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Supplemental Material

Supplemental material for this article is available online.

Notes

References

Agresti

2002. Categorical Data Analysis. New Jersey: Wiley-Interscience.

Atkinson

A. C.

1978. “Posterior Probabilities for Choosing a Regression Model.” Biometrika 65:39–48.

Belloni

Chernozhukov

. 2013. “Least Squares after Model Selection in High-dimensional Sparse Models.” Bernoulli 19:521–47.

Belloni

Chernozhukov

Hansen

. 2013. “Inference on Treatment Effects after Selection among High-dimensional Controls.” Review of Economic Studies 81:608–50.

Burnham

K. P.

Anderson

D. R.

. 2004. “Multimodel Inference: Understanding AIC and BIC in Model Selection.” Sociological Methods and Research 33:261–304.

Chen

. 2012. “Extended BIC for Small-n-large-p Sparse GLM.” Statistica Sinica 22:555–74.

Chiquet

Grandvalet

Rigaill

. 2016. “On Coding Effects in Regularized Categorical Regression.” Statistical Modelling 16:228–37.

Clogg

C. C.

1982. “Some Models for the Analysis of Association in Multiway Cross-classifications Having Ordered Categories.” Journal of the American Statistical Association 77:803.

Duncan

O. D.

1979. “How Destination Depends on Origin in the Occupational Mobility Table.” American Journal of Sociology 84:793.

10.

Dziak

J. J.

Coffman

D. L.

Lanza

S. T.

. 2015. “Sensitivity and Specificity of Information Criteria.” PeerJ Preprints 1:1–20.

11.

Erikson

Goldthorpe

J. H.

. 1992. The Constant Flux: A Study of Class Mobility in Industrial Societies. Oxford, United Kingdom: Clarendon Press.

12.

Esteve

McCaa

López

. 2013. “The Educational Homogamy Gap between Married and Cohabiting Couples in Latin America.” Population Research and Policy Review 32:81–102.

13.

Fitzmaurice

Goldthorpe

J. H.

. 1997. “Adjusting for Overdispersion in an Analysis of Comparative Social Mobility.” Sociological Methods & Research 25:267–83.

14.

Friedman

Hastie

Tibshirani

. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33:1–22.

15.

Grusky

D. B.

Hauser

R. M.

. 1984. “Comparative Social Mobility Revisited: Models of Convergence and Divergence in 16 Countries.” American Sociological Review 49:19–38.

16.

Gullickson

2005. “The Significance of Color Declines: A Re-analysis of Skin Tone Differentials in Post-civil Rights America.” Social Forces 84:157–80.

17.

Gullickson

Torche

. 2014. “Patterns of Racial and Educational Assortative Mating in Brazil.” Demography 51:835–56.

18.

Hastie

Tibshirani

Friedman

. 2009. The Elements of Statistical Learning. New York: Springer Series in Statistics.

19.

Hauser

R. M.

1980. “Some Exploratory Methods for Modeling Mobility Tables and Other Cross-classified Data.” Sociological Methodology 11:413–58.

20.

Hout

1983. Analyzing Mobility Tables. Beverly Hills, CA: Sage.

21.

Hout

1984. “Occupational Mobility of Black Men: 1962 to 1973.” American Sociological Review 49:308–22.

22.

Huang

Zhang

C.-H.

. 2008. “Adaptive Lasso for Sparse High-dimensional Regression Models.” Statistica Sinica 18:1603–18.

23.

Hurvich

C. M.

Tsai

C. L.

. 1989. “Regression and Time Series Model Selection in Small Samples.” Biometrika 76:297–307.

24.

Jaggi

2014. “An Equivalence between the Lasso and Support Vector Machines.” Pp. 1–26 in Regularization, Optimization, Kernels, and Support Vector Machines, 1st ed., edited by Suykens

J. A.

Signoretto

Argyriou

. New York: Chapman and Hall/CRC.

25.

James

Witten

Hastie

Tibshirani

. 2013. An Introduction to Statistical Learning. New York: Springer Texts in Statistics.

26.

Jonsson

Grusky

Di Carlo

Pollak

Brinton

. 2009. “Microclass Mobility: Social Reproduction in Four Countries.” American Journal of Sociology 114:977–1036.

27.

Kuha

2004. “AIC and BIC Comparisons of Assumptions and Performance.” Sociological Methods & Research 33:188–229.

28.

Leeb

Pötscher

B. M.

. 2008. “Sparse Estimators and the Oracle Property, or the Return of Hodges’ Estimator.” Journal of Econometrics 142:201–11.

29.

Little

Raymer

. 2013. “Log-linear Models of Migration Flows.” Pp. 403–19 in Tools for Demographic Estimation, edited by Moultrie

Dorrington

Hill

Timæus

Zaba

. Paris, France: International Union for the Scientific Study of Population.

30.

Luo

Hodges

Winship

Powers

. 2016. “The Sensitivity of the Intrinsic Estimator to Coding Schemes: Comment on Yang, Schulhofer-Wohl, Fu, and Land.” American Journal of Sociology 112:881–94.

31.

Mare

R. D.

1991. “Five Decades of Educational Assortative Mating.” American Sociological Review 56:15–32.

32.

Mare

R. D.

Schwartz

C. R.

. 2006. “Educational Assortative Mating and the Family Background of the Next Generation.” Sociological Theory and Methods 21:253–78.

33.

McFarland

D. A.

Lewis

Goldberg

. 2016. “Sociology in the Era of Big Data: The Ascent of Forensic Social Science.” The American Sociologist 47:12–35.

34.

Meier

Van De Geer

Bühlmann

. 2008. “The Group Lasso for Logistic Regression.” Journal of the Royal Statistical Society. Series B: Statistical Methodology 70:53–71.

35.

Powers

D. A.

Xie

. 2000. Statistical Methods for Categorical Data Analysis. San Diego, CA: Academic Press.

36.

Qian

1997. “Breaking the Racial Barriers: Variations in Interracial Marriage between 1980 and 1990.” Demography 34:263–76.

37.

Raftery

1995. “Bayesian Model Selection in Social Research.” Sociological Methodology 25:111–69.

38.

Raymer

Rogers

. 2007. “Using Age and Spatial Flow Structures in the Indirect Estimation of Migration Streams.” Demography 44:199–223.

39.

Raymo

J. M.

Xie

. 2000. “Temporal and Regional Variation in the Strength of Educational Homogamy.” American Sociological Review 65:773–81.

40.

Schwartz

C.R.

2010. “Earnings Inequality and the Changing Association between Spouses’ Earnings.” American Journal of Sociology 115:1524–57.

41.

Schwartz

C. R.

2013. “Trends and Variation in Assortative Mating: Causes and Consequences.” Annual Review of Sociology 39:451–70.

42.

Schwartz

C. R.

Mare

R. D.

. 2005. “Trends in Educational Assortative Marriage from 1940 to 2003.” Demography 42:621–46.

43.

Schwartz

C. R.

Mare

R. D.

. 2012. “The Proximate Determinants of Educational Homogamy: The Effects of First Marriage, Marital Dissolution, Remarriage, and Educational Upgrading.” Demography 49:629–50.

44.

Schwartz

C.R.

Zeng

Xie

. 2016. “Marrying up by Marrying Down: Status Exchange between Social Origin and Education in the United States.” Sociological Science 3:1003–27.

45.

Tibshirani

1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society 58:267–88.

46.

Tibshirani

1997. “The Lasso Method for Variable Selection in the Cox Model.” Statistics in Medicine 16:385–95.

47.

Tilly

1999. Durable Inequality. Berkeley: University of California Press.

48.

Torche

2005. “Unequal but Fluid: Social Mobility in Chile in Comparative Perspective.” American Sociological Review 70:422–50.

49.

Torche

2010. “Educational Assortative Mating and Economic Inequality: A Comparative Analysis of Three Latin American Countries.” Demography 47:481–502.

50.

Tutz

Gertheiss

. 2016. “Regularized Regression for Categorical Data.” Statistical Modelling 16:161–200.

51.

Weakliem

D. L.

1999. “A Critique of the Bayesian Information Criterion for Model Selection.” Sociological Methods & Research 27:359–97.

52.

Weakliem

D. L.

2004. “Introduction to the Special Issue on Model Selection.” Sociological Methods and Research 33:167–87.

53.

Weeden

K. A.

Grusky

D. B.

. 2005. “The Case for a New Class Map.” American Journal of Sociology 111:141–212.

54.

Weeden

K. A.

Grusky

D. B.

. 2012. “The Three Worlds of Inequality.” American Journal of Sociology 117:1723–85.

55.

Willekens

1983. “Log-linear Modelling of Spatial Interaction.” Papers in Regional Science 52:187–205.

56.

Willekens

2016. “Migration Flows: Measurement, Analysis and Modeling.” Pp. 225–41 in International Handbook of Migration and Population Distribution, edited by White

M. J.

. Dordrecht, the Netherlands: Springer.

57.

Xie

1992. “The Log-multiplicative Layer Effect Model for Comparing Mobility Tables.” American Sociological Review 57:380–95.

58.

Young

2009. “Model Uncertainty in Sociological Research: An Application to Religion and Economic Growth.” American Sociological Review 74:380–97.

59.

Young

Holsteen

. 2017. “Model Uncertainty and Robustness: A Computational Framework for Multimodel Analysis.” Sociological Methods and Research 46:3–40.

60.

Yuan

Lin

. 2006. “Model Selection and Estimation in Regression with Grouped Variables.” Journal of the Royal Statistical Society 68:49–67.

61.

Zou

2006. “The Adaptive Lasso and Its Oracle Properties.” Journal of the American Statistical Association 101:1418–29.

62.

Zou

Hastie

. 2005. “Regularization and Variable Selection via the Elastic-net.” Journal of the Royal Statistical Society 67:301–20.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.86 MB