Abstract
Notwithstanding a large body of literature on log-linear models and odds ratios, no general marginal-free index of the association in a contingency table has gained a wide acceptance. Building on a framework developed by L. A. Goodman, we put into light the direct links between odds ratios, the Altham index, the intrinsic association coefficient, and coefficients in log-multiplicative models including Unidiff and row-column association models. We devise a normalized version of the latter coefficient varying between 0 and 1, which offers a simpler interpretation than existing indices similar to the correlation coefficient. We illustrate with the case of educational and socioeconomic homogamy among 149 European regions how this index can be used either alone in a non- or semiparametric approach or combined with models, and how it can protect against incorrect conclusions based on models which rely on strong assumptions to summarize the strength of association as a single parameter.
Keywords
Despite the existence of a substantial and long-standing literature on odds ratios and log-linear models, it is surprising that no general marginal-free index of the association between categorical variables has become standard. While a number of indices based on Pearson contingencies (such as the mean square contingency coefficient
This lack is particularly striking in studies of intergenerational social mobility or homogamy, which rely heavily on contingency table analysis. Indeed, in the recent years, the economic literature on intergenerational mobility has much developed, focusing on income rather than categorical measures like social class. One of the strengths of this literature rests in its methodological unification, with the use of intergenerational income elasticities or correlations as standard tools. We suggest that the lack of such a standard index is a comparative disadvantage for sociology in this field. As Blanden (2013) notes in her comparison of the two approaches: It is one of the disadvantages of the social class literature that there is not a more intuitive summary measure of mobility; for the purpose of this summary we would benefit greatly from a single mobility parameter for each nation and point in time, which could be easily compared with the measures for income and education mobility. (p. 44)
The objective of the present article is two-fold. First, we would like to highlight the value of several related association indices based on the odds ratio. To this end, we will mobilize the framework established in a major article by Goodman (1996) that sought to reconcile two opposed traditions: on the one hand, Pearson contingencies
As a second goal, we will try to help the adoption of odds ratio-based indices by making them easier to use. Indeed, one possible reason for the lack of success of the Altham index may be difficulty in interpreting it. In an attempt to alleviate this issue, we will first show that the intrinsic association coefficient and the Altham index are equal (up to a multiplicative factor) to the logarithm of the (geometric) standard deviation of all the odds ratios that can be computed from a table. Then, we will propose a new normalization of the intrinsic association coefficient varying between 0 and 1, similar to the well-known Pearson correlation coefficient, mean square contingency coefficient
We begin by presenting the general framework as established by Goodman, unifying Pearson’s approach and the odds ratio approach, so as to derive the intrinsic association coefficient, an index of the general intensity of association insensitive to the table’s margins and dimensions. Then, we show that this index is directly related to the standard deviation of all log-odds ratios in a table and therefore to the Altham index. We then develop the strong relation between these indices derived from the odds ratio and standard log-multiplicative models: Unidiff, row-column association model (RC(M)), and row-column association model with layer effect (RC(M)-L). Finally, we illustrate the interest of these association indices for the analysis of the determinants of socioeconomic homogamy in 149 regions of the European Union, using the R package logmult (Bouchet-Valat 2018b) for the estimations.
General Framework Unifying the Pearson and Odds Ratios Traditions
We begin by presenting the general framework established by Goodman (1996) for the analysis of the association in a contingency table, using that article’s terminology and notation. The article systematizes results the author gradually developed in a series of papers (in particular, Goodman 1986, 1991). We then briefly show how this framework can be used to revisit the Pearson approach from a new perspective, and then we develop the elaboration of the intrinsic association coefficient as a general measure of the association in a contingency table according to the odds ratio tradition.
Preliminary Definitions
Let
Using this notation, let us define the Pearson ratio, also known “independence ratio,” “mobility ratio,” or “homogamy ratio” in the sociological literature:
with
We may then define the unweighted interaction corresponding to a given cell as
Similarly, we may define the weighted interaction as
Although for simplicity’s sake we use weighting equal to the table’s marginal proportions, any strictly positive set of weights summing to unity may be used. This applies in particular to uniform weights equal, respectively, to
In order to obtain a general measure of the intensity of association within a table, Goodman (1996:7) proposed a generalized index of nonindependence. In its unweighted version, denoted
And in its weighted version, denoted
Instead of marginal weighting, the uniform weighting already mentioned above provides a third often-used version of the coefficient, denoted
An interesting property of the various versions of the nonindependence index is that the square of the index, defined as a sum of contributions per cell, can also be broken down into contributions per row and column.
We first show that the weighted version of the index, denoted
Pearson’s Mean Square Contingency Coefficient
If we define R as the identity function, that is,
In this case, we find that the values, using Goodman’s terminology, are the Pearson contingencies, or the square roots of the Pearson residuals. The general index of nonindependence defined in equation 6 as
where N is the sum of counts in the table.
Intrinsic Association Coefficient
If we instead define R as the natural logarithm, that is,
It can be shown that the
Therefore, the odds ratio contrasting rows
Like the odds ratio, the
We obtain a similar equation for the weighted interaction:
However, in that case, the
It is also interesting to note that the marginal-weighted version of the index does not change if two rows (respectively, columns) with the same conditional distributions are combined (Goodman 1996:425). This property is also possessed by Pearson’s mean square contingency coefficient
We call the index derived in this section the intrinsic association coefficient,
3
following the terminology introduced by Goodman (1981a, 1985, 1986, 1991) for association models. The link between the definition of the index presented above and log-linear and log-multiplicative models, justifying this terminology, will be developed below. Let us note, however, that an application of this index to logistic regression has previously been called
Normalized Intrinsic Association Coefficient
The intrinsic association coefficient is expressed on the scale of the logarithm of odds ratios: It equals zero when independence holds and has no upper limit. Therefore, although it possesses the desired marginal-independence property and is thus a useful tool, it does not make it easy to assess the strength of the association, given that the log-odds ratio scale is not familiar to a wide audience. We propose normalizing the coefficient so that it follows the well-known scale from 0 to 1 used by the Pearson correlation coefficient, the mean square contingency coefficient
This transformation of the intrinsic association coefficient from a scale from 0 to infinity to a scale from 0 to 1 is not just an artificial device to make the index appear more familiar. Indeed, it has been shown that there exists a direct relationship between the marginal-weighted intrinsic association coefficient and the correlation coefficient in the special case when frequencies in a contingency table are distributed according to a discretized bivariate normal distribution (Becker 1989; Becker and Clogg 1988; Goodman 1981b, 1985). We can therefore define a normalized version of the intrinsic association coefficient with uniform weighting
and with arbitrary weighting
The normalized and nonnormalized versions of the intrinsic association coefficient are very close for values below 0.3: A normalized intrinsic association coefficient of 0.3 corresponds to a (nonnormalized) intrinsic association coefficient of 0.33. The difference then increases quickly beyond that limit: 0.5 corresponds to 0.67, 0.7 to 1.37, and 0.9 to 4.74. For the special case of the complete absence of association, we define in accordance with the limit that
In practical use, the normalized index can be preferred when reporting results to ease interpretation. However, the nonnormalized index is more appropriate in contexts where the absence of an upper bound is an advantage, notably when the strength of the association is used as the dependent variable in a linear model, as we will illustrate below.
A Derivation of the Intrinsic Association Coefficient from Odds Ratios
We have shown in the previous section that the intrinsic association coefficient could be considered as the odds ratio-based equivalent of the Pearson mean square contingency coefficient
For the purposes of the demonstration, let us define the standard odds ratio (SOR) as the geometric standard deviation of all the odds ratios that can be calculated for a table. Just as the intrinsic association coefficient defined at equations 6 and 7 equals the standard deviation of interaction coefficients
One way of enumerating all the odds ratios that can be constructed from an
Using Uniform Weighting
We can now define the SOR with uniform weighting, 4 and then in the next section generalize it for arbitrary weights:
It appears that the quadruple sum in equation 16 is actually equal to the square of the Altham index measuring the distance of a two-way table from independence (Altham 1970; Altham and Ferrie 2007). We will return to this below.
The uniform-weighted SOR can also be expressed in terms of the intrinsic association coefficient. Indeed, since the
So we find that the intrinsic association coefficient with uniform weighting, defined in equation 7, is equal to half 5 the logarithm of the geometric standard deviation of all the table’s odds ratios (here expressed as the standard deviation of the log-odds ratios):
This result also allows deriving the very direct relation between the Altham index
While being very close to the Altham index, the intrinsic association coefficient offers a significant advantage over its competitor: It is insensitive to the dimension of the table, that is, using a larger number of categories does not mechanically increase the value of the index. In that regard, the intrinsic association coefficient has a similar relationship to the Altham index as Cramér’s V to the mean square contingency coefficient
Using Arbitrary Weighting
Following the approach used above, we now define the SOR with arbitrary weighting, which generalizes the results of the previous section. For simplicity’s sake, as before, we present the specific case of marginal weighting but the demonstrations hold, unless otherwise indicated, for any set of strictly positive weights that sum to unity (on condition that the
This second version of the SOR, denoted
Similar to the previous section, we observe that the quadruple sum is a weighted generalization of the square of the Altham index. By replacing in equation 20 all the
Using the same procedure as for the uniform-weighted index in equation 17, we can establish the link between the weighted geometric standard deviation of odds ratios and the weighted intrinsic association coefficient. Indeed, since the weighted row and column sums of the
Again, we find that the weighted intrinsic association coefficient, defined at equation 6, is equal to half the logarithm of the weighted geometric standard deviation of all odds ratios that can be constructed from a table:
As already indicated, we can see that by replacing, in equations 21 and 22,
Relation to the Unidiff Model
Replacing the Altham index and the intrinsic association coefficient in a common framework is particularly useful as it allows unifying the descriptive approach of the Altham index and the parametric approach of log-linear and log-multiplicative modeling. One area where the similarity is striking is the analysis of variations in the overall strength of the association over the last dimension (layer) of a three-way table.
Indeed, the indices presented above are directly related to the association represented by the log-multiplicative layer effect model (Xie 1992), better known as the Unidiff model (Erikson and Goldthorpe 1992).
6
The proportions expected under this model follow the equation, with
For a given layer
It follows that the ratio between the intensities of the associations relating to layers
Therefore, if we denote by index
The same properties are verified for the Altham index (Zhou 2015) because of its direct relation with the intrinsic association coefficient evidenced in equation 19.
It is easy to verify with the same procedure that this property is verified when arbitrary weighting is used, as long as the weights are independent of the layer under consideration. This holds in particular for the weighting by margins of the whole table (average-marginal weighting, see Becker and Clogg 1989), which is an interesting alternative to uniform weighting when one seeks to examine the variations between layers in the intensity of the association independent of the table margins. Let us note that, extending a result highlighted in the first section regarding two-dimensional tables, the values of the index computed with average-marginal weights do not change when combining rows (respectively, columns) with identical conditional distributions. This makes this weighting system particularly appealing.
Using intrinsic association coefficients or Altham indices corresponding to layers therefore allows comparing them in the same way as by using the layer effect coefficients
Finally, when the Unidiff model does not accurately fit the data, these three indices can be used to measure the intensity of the association relating to the various layers without assuming that the structure of this association is homogeneous between layers. In this sense, they are generalizations of the measure provided by the layer effect coefficient of the Unidiff model. This approach can be carried out either by calculating the value of these indices directly from the observed data or by combining them with models more complex than Unidiff, such as the regression-type model (Goodman and Hout 1998) or the RC(M)-L that we describe in the next section.
Relation to RC(M) Association Models
RC(M) Association Model
The intrinsic association coefficient was devised by Goodman for association models (Becker and Clogg 1989; Clogg and Shihadeh 1994; Goodman 1981a, 1985, 1986; Wong 2010): It is thus directly related to these models, and the Altham index inherits this close relation. With the log-multiplicative RC(M) (also known as RC(M) or Goodman’s RC type II model), the expected proportions follow the equation, with
In this equation,
In the weighted version, the equation of the model is:
In an association model, the significance of a dimension is measured by the corresponding intrinsic association coefficient, generally denoted
So the contribution of each dimension to the interaction is
And in the weighted version, using equations 6, 29, and 30 this time:
It can be seen that the intrinsic association coefficient
Similarly, in the weighted version, by equations 6 and 30:
So the overall intrinsic association coefficient (weighted or otherwise) equals the Euclidean norm of the intrinsic association coefficients corresponding to each dimension of the model. An association model is thus a way of decomposing the total association in the table into a series of dimensions of diminishing significance. This decomposition is valid whether the model is saturated or not, as long as it fits the data properly. Association models therefore stand in the same relation to the odds ratio tradition as correlation or correspondence analysis do to the Pearson tradition (Gilula and Haberman 1986; Goodman 1985, 1986, 1991, 1996). Once again, the Altham index is also tightly linked to this approach, though this relation is less direct than for the intrinsic association coefficient.
Extension to RC(M)-L Association Model
RC(M)-L models (Clogg 1982; Wong 2010) are an extension of RC(M) models to three-dimensional tables: The intrinsic association coefficient and/or scores can vary from one layer to another. One version of this model postulates that the association is identical for all layers (homogeneous scores and intrinsic association coefficients); another, that it differs entirely between layers (heterogeneous scores and intrinsic association coefficients): These two versions can be reduced either in the first case to the scores of a single RC(M) model (but with layer-specific marginal parameters) or to those of as many RC(M) models are there are layers.
Only the third version of the RC(M)-L model requires an extension of the approach presented so far. This version of the model assumes that the scores are homogeneous between layers but that the intrinsic association coefficients are heterogeneous. Its equation, with
or, in its weighted version,
Only the first two constraints of equations 28 and 30, applying to the scores, are required: Cross-dimensional constraints can no longer be applied and there is generally a nonzero correlation between the scores in different dimensions. Consequently, the reasoning followed in equations 34 and 35 cannot be used. The relation between the intrinsic association coefficients for each dimension and the overall intrinsic association coefficient is not a simple summation: It must take into account the correlation between dimensions. In equations 34 and 35, the respective terms:
and
corresponding to the sum of the products of the intrinsic association coefficients and the correlations between (respectively) the row and column scores of the dimensions taken two at a time do not generally equal zero. For example, the intensity of association on a given layer depends on the positive, negative, or zero correlation between the scores and the intensities of the different dimensions. Intuitively, one dimension may offset the association represented by another if it is strong enough and the two dimensions have sufficiently different scores. Note too that the intrinsic association coefficients here may be negative, which amounts in practice to inverting the sign of the row or column scores and thus inverting the direction of the link compared to layers where the coefficient was positive.
Despite this greater complexity, which is due to the richness of the RC(M)-L model, both the intrinsic association coefficient and the Altham index can always be calculated separately, for the overall association and for each dimension. The analysis of the correlation between dimensions can also be of interest to better understand the variations of the overall association.
In conclusion, note that, as with the Unidiff model, analysis of the differences in association independently of marginal variations between layers can be achieved by adopting either uniform weighting or weighting by the average row and column margins of the table (rather than by the margins of each layer).
Application: Educational and Socioeconomic Homogamy among European Regions
This section illustrates the interest of the association indices presented above to analyze the spatial variations of educational and socioeconomic homogamy among European regions (see Bouchet-Valat 2018a, for a more complete analysis). Like intergenerational social mobility, homogamy has typically been studied in the literature using marginal-free methods such as log-linear models and other odds ratio-based techniques (for international comparisons, see Domański and Przybysz 2007; Katrňák, Fučík, and Luijkx 2012; Katrňák, Kreidl, and Fónadová 2006; Park and Smits 2005; Raymo and Xie 2000; Smits 2003; Smits, Ultee, and Lammers 1998, 1999, 2000). Multiple families of log-linear and log-multiplicative models have been used by different authors, so that no straightforward comparison of the results is possible. The association indices presented in this article would allow summarizing the model results in a single figure given the overall strength of homogamy in each studied society, despite the variety of the chosen modeling strategies.
The example presented here will also highlight a risk which researchers may run when trying to use models to obtain a single measure of the strength of the association. Often, only relatively restrictive models will provide such a summary parameter: The log-multiplicative layer effect (Unidiff) estimates a layer coefficient, the log-multiplicative row-column association model (RC-L) estimates an intrinsic association coefficient, and the distance log-linear model estimates a step parameter. When these simple models do not fit the data adequately, more complex models may be more appropriate, like the regression-type log-multiplicative model (Goodman and Hout 1998), or multidimensional association models (like RC(M)-L models). Even more frequently, cell-specific parameters will have to be introduced for the main diagonal of the homogamy table in order to account for the varying intensity of homogamy between groups; these parameters may also be country specific. In these cases, no single measure of the strength of homogamy in a given country can be obtained. Researchers may then be tempted either to analyze the determinants of one of the components of the association, and ignore the others (as did Domański and Przybysz [2007] by regressing step parameters and leaving aside diagonal parameters); or to use simpler models which may not give a completely accurate description of the data. In what follows, we illustrate this risk using the Unidiff model.
The illustration is based on the analysis of educational and socioeconomic homogamy tables for 149 infra-national regions of the European Union (NUTS1 and NUTS2 levels, regrouping between 800,000 an 7 million people) for years 2014–2016. These tables have been computed from the corresponding waves of the European Union Labour Force Survey, covering 26 European Union member States 7 : Austria (AT), Belgium (BE), Bulgaria (BG), Croatia (HR), Czech Republic (CH), Cyprus (CY), Estonia (EE), France (FR), Germany (DE), Greece (GR), Hungary (HU), Ireland (IE), Italy (IT), Latvia (LV), Lithuania (LT), Luxembourg (LU), the Netherlands (NL), Norway (NO), Poland (PO), Portugal (PT), Romania (RO), Slovakia (SK), Slovenia (SI), Spain (ES), Sweden (SE), and the United Kingdom (UK). Cohabiting couples (both married and unmarried) have been identified within each household using partner identifiers. To ensure the reliability of the information on occupations and stability of the rate of individuals in a relationship, only couples in which both partners are aged 30–59 years are considered. The sample is made of 1,400,000 couples for educational homogamy and of 1,100,000 couples for socioeconomic homogamy (with regional samples ranging from 1,000 to 55,000). 8
The educational levels of the partners are measured in the International Standard Classification of Education (ISCED) 2011, in four categories: lower secondary or less (ISCED 0–2, including short vocational education); upper secondary (ISCED 3); lower tertiary (ISCED 4–6: up to and including bachelor’s); and upper tertiary (ISCED 7–8: master’s and beyond). The socioeconomic groups of the partners are measured using the European Socio-economic Groups classification (ESeG, see Meron and Amar 2014) in seven categories: managers; professionals; technicians and associated professionals; small entrepreneurs; clerks and skilled service employees; industrial skilled employees; Less skilled employees. For individuals not employed at the time of the survey, we use information on the last occupation (available only for those who worked within eight years before the survey).
One major question in the comparative literature on homogamy concerns its relationship with the level of economic development. Studies have sought to test the empirical validity of the inverted U curve hypothesis (Smits et al. 1998), according to which educational homogamy would increase in the first stage of development, but then decrease in a later stage. A variation of this hypothesis posits that a stabilization will be observed at the highest levels of development (saturation hypothesis). We will only deal with the part of the curve concerning advanced economies, to which European Union countries belong. To this end, two independent variables will be used. First, the average disposable income per inhabitant (in purchasing power parity) as computed by Eurostat for NUTS1 and NUTS2 regions in 2006 is used to measure economic development at the regional level. Second, we classify regions according to whether they contain the capital city of their countries or a large metropolis. 9 This second variable will allow us to distinguish the role of economic development per se and that of the peculiarities of very dense regions which are generally richer but also present higher inequality levels and are large enough so that intergroup contacts may not be as developed as in less populated regions (meeting opportunity effect).
Measuring Homogamy
Multiple approaches can be used to measure the strength of relative homogamy (i.e., controlling for the population structure in each region). We may opt for a fully nonparametric approach by computing the intrinsic association coefficient (or equivalently the Altham index) directly on the observed data. We may also retain a semiparametric estimator of the intrinsic association coefficient, like the Bayesian shrinkage method proposed by Zhou (2015) for the Altham index, whose principle is to estimate log-odds ratios for a given region more accurately by “borrowing strength” from the tables for other regions. Finally, we may also fit several models to the data and choose the one which provides the most accurate description according to classical criteria; the indices can then be computed on the fitted tables. We illustrate all three approaches in order to compare their results below. In all cases, we use average-marginal weighting.
As usual, we start with the conditional independence model, which only controls for the marginal distribution of men and women among the seven socioeconomic groups or the four educational categories (respectively) in each region but does not allow for any tendency to relative homogamy. The equation of this model is:
Fit statistics confirm that this model does not describe accurately the data (Table 1), with, respectively, 23 percent and 21 percent of misclassified couples (dissimilarity index) for education and socioeconomic group. The second model, called stability model, extends the first one by allowing for a common association to all regions once margins have been controlled. Its equation is:
Fit Statistics for Log-linear and Log-multiplicative Models.
Note. df = degrees of freedom;
This model improves the fit significantly, with only 5.3 percent and 6.2 percent of misclassified couples and a clear reduction in both the Bayesian information criterion (BIC) and the Akaike information criterion (AIC).
To measure geographic variations in the strength of relative homogamy, we have to find models which fit the data better than the stability baseline. Since the socioeconomic groups cannot be unequivocally ordered, the log-multiplicative layer effect model or Unidiff (Erikson and Goldthorpe 1992; Xie 1992) is a natural choice. 10 As mentioned above, this model is a good candidate for our purpose since it provides a single coefficient for each region, measuring the intensity of relative homogamy assuming that the structure of the row-column interaction is the same in all regions. This model follows the equation:
The Unidiff model reveals significant variations of relative homogamy between regions, as both the AIC and the BIC decrease very clearly. However, the improvement to the description of the data is modest: The proportion of misclassified couples goes down by only two percentage points for education and by less than one percentage point for socioeconomic group.
Does the assumption that the pattern of the association is the same in all regions on which rests the Unidiff model hold? Clearly not, as a fourth model including one additional parameter for each cell on the main diagonal of the table for each region allows reducing the share of misclassified couples by about two percentage points and improves the fit both according to the BIC and the AIC for education, and according to the AIC for socioeconomic homogamy. The equation of this model is:
Even this more complex model fails to fit the data according to the AIC (as the strongly positive value shows that the saturated model should be preferred), which indicates that more statistically significant deviations remain. We will not try to find better models here since our goal is precisely to evaluate how close are relatively classic models to the nonparametric and semiparametric estimators.
The association indices presented in this article make the comparison of the estimates of the strength of the association provided by the above models very straightforward. Indeed, despite the different specifications, we can simply compute the values of the indices based on the fitted counts of the models (i.e., the
The comparison shows that the results obtained using the different methods are very similar overall (Table 2). For both types of homogamy, the weakest correlation across regions is observed between the nonparametric estimator and the Unidiff model: .82 for education and .87 for socioeconomic group. This is expected since these are, respectively, the least and the most restrictive estimation methods. This correlation is already quite high, indicating that using the Unidiff model as an approximation would globally yield correct results. The strongest correlations are observed between the nonparametric and the shrinkage estimators, at .97 for both education and socioeconomic group. The Unidiff model with diagonal parameters also agrees very closely with the nonparametric and shrinkage estimators (from .93 to .97).
Correlations between the Indices Estimated via Four Different Methods.
Note. Correlations are weighted using the population size of each region.
For both types of homogamy, the average of the coefficients across all regions (Table 3) is the highest with the nonparametric approach (.72 for education and .57 for socioeconomic group) and the lowest with the standard Unidiff (respectively, .66 and .51). This is consistent with the correlation between these two methods being the lowest. The Bayes shrinkage estimator is midway between the two extremes in both cases, and the Unidiff with diagonal parameters is closer to one or the other indicator depending on the type of homogamy considered.
Average, Dispersion, and Range of the Indices Estimated via Four Different Methods.
Note. Averages are weighted using the population size of each region.
One advantage of the intrinsic association coefficient is that the number of categories used to measure each type of homogamy does not mechanically affect the level of the index, allowing for (cautious) comparisons between different types of homogamy. Here, educational homogamy consistently appears as stronger than socioeconomic homogamy, by 25–35 percent depending on the estimator. 13 This is also the case considering each region separately, in 120 to 129 regions of the 149.
The standard deviation of association across regions and the difference between maximum and minimum associations are the lowest for the shrinkage estimator, which is expected due to the definition of the estimator, which brings log-odds ratios closer to their European average. They are quite higher for the nonparametric estimator, which is again expected. It is more surprising to remark that the standard Unidiff model estimator has the largest standard deviation and range for socioeconomic group. Contrary to the Bayesian shrinkage estimator, it appears it is not always the case that the Unidiff model provides conservative estimates of deviations from the average.
The normalized variant of the index varying between 0 and 1 presented at equation 15 can be used to ease the interpretation of these results. For the nonparametric approach, the average association of .72 for education gives a normalized coefficient of .52; the average association of .57 for socioeconomic group gives a normalized coefficient of .45. Regional normalized coefficients range from .37 to .72 for education and from .33 to .58 for socioeconomic group. According to standard effect strength conventions for the Pearson correlation and contingency coefficients, these associations would range from medium to very large (Cohen 1988, ch. 7).
It is interesting to note that even though differences between estimation methods are limited, the standard Unidiff model, which is the most common method used in the literature to estimate overall the level of association, tends to slightly underestimate the mean association, even if maximum values are in some cases higher than with other methods. We can therefore conclude that while Unidiff remains a useful tool, non- and semiparametric estimators or more complex models should be preferred.
Accounting for Variations in Relative Socioeconomic Homogamy
In order to illustrate the fact that the relatively limited differences between estimators of the association observed above can lead to substantively different conclusions, we now turn to the analysis of the macro determinants of the intensity of relative socioeconomic homogamy. For simplicity, we will not cover the case of educational homogamy, since for this dimension differences between estimators are less marked. The full analysis is available in a separate article (Bouchet-Valat 2018a).
We take the logarithm of the intrinsic association coefficient obtained by the four estimation methods as the dependent variable in an ordinary least squares regression model. This is appropriate since the index cannot take negative values. Variables therefore have a multiplicative effect on the association level. As developed above, the model includes as independent variables the disposable income per inhabitant and its square (variable was standardized so that its mean is zero and its standard deviation is one) and whether the region includes a capital city or a second-tier metropolitan area. Finally, we introduce country fixed effects (i.e., one dummy variable for each country) so that the coefficient estimates reflect the deviation in the strength of relative homogamy with reference to the country average.
The comparison of the R 2 for the four models (Table 4) shows that proportions of explained variance are similar, from .82 to .88. This very high figure is due to the inclusion of country fixed effects. The within-country R 2 (i.e., the share of the variance not explained by country fixed effects which is explained by the full model) varies more significantly, from .15 for the nonparametric estimator down to .10–.11 for the other three estimators.
Linear Regression Results for Relative Socioeconomic Homogamy.
Note. 95 percent normal bootstrap confidence intervals in parentheses. The model is fitted on 146 regions of the 149 due to the nonavailability of independent variables.
aStandardized variable (zero mean and unit standard deviation).
This result is consistent with the fact that estimated coefficients are generally farther away from 1 (indicating no effect) for the nonparametric estimator. The effect of level of metropolization varies in a nonnegligible way across estimating methods. Capital regions are characterized by a higher socioeconomic homogamy than regions with no metropolis by 7 percent according to the shrinkage and standard Unidiff estimators, by 10 percent according to the nonparametric estimator, and by 13 percent according to the Unidiff with diagonal parameters estimator. This effect is statistically significant at the 5 percent level for all estimators. Regions with a second-tier metropolitan area also show a higher homogamy by 4–5 percent.
Differences between estimators are even more visible regarding the effect of the level of development. Using the nonparametric estimator, the coefficients for disposable income (.94, significant at the 5 percent level) and its square (1.05, also significant) imply that moving from a disposable income two standard deviations below the average to a value equal to the average decreases relative homogamy by 27 percent, and that moving from the average to two standard deviations above average slightly increases relative homogamy (by 7 percent). 14 Socioeconomic homogamy therefore tends to decline with economic development but stabilizes above the average level of development. A similar, though weaker effect (and borderline significant at the 5 percent level) is observed using the shrinkage (respectively, −20 percent and +10 percent) and Unidiff with diagonal parameter (−24 percent and +12 percent) estimators. On the contrary, when the association is measured using the standard Unidiff model, the effect of disposable income is so small (.98) that it is no longer statistically significant at the 10 percent level. 15 Only its square has a significant positive effect, implying that socioeconomic homogamy is the highest for both the least and the most developed regions within a country, but that no decreasing trend is detected.
These results illustrate the impact small inaccuracies in the model-based measurement of the association can have on the results of subsequent analyses. Even if the overall correlation between the association measures obtained using the three different methods are quite high (over .8), the assumption of a common pattern of the association across all European regions does not really hold, which becomes more visible when considering within-country differences. This problem is likely to be more severe in cases where the standard Unidiff model does not fit the data as accurately as it does in the present study (dissimilarity index of 5.6 percent). Using the standard Unidiff model, we would have (incorrectly) concluded that a U-shaped relationship exists between level of development and relative socioeconomic homogamy, while the comparison with other estimators of the association shows that homogamy actually stabilizes rather than increases at higher levels of development. This result is consistent with that obtained for educational homogamy with all four estimating methods (Bouchet-Valat 2018a).
We have shown how a general-purpose marginal-free index of the association like the intrinsic association coefficient could be used to compare the results obtained using various model specifications, as well as using a semi- or nonparametric approach. This can be particularly useful to carry out a sensitivity analysis testing multiple modeling assumptions. Depending on the researcher’s needs, this index can be used as a way of summarizing the strength of the association described by a chosen model whose coefficients are commented in detail, or as a way to measure the strength of the association without fitting any model to the data. Association indices offer a summary measure of the level of the association, while models are most useful to describe its patterns in a more fine-grained way or to test-specific hypotheses.
Conclusion
We have presented three closely related marginal-free indices of the association in a contingency table: the Altham index, the intrinsic association coefficient, and a normalized variant of the latter index. The Altham index has been used several times in recent empirical works, but its relationship to odds ratios, log-linear, and log-multiplicative models (in particular, association models) had not been developed systematically until now. On the contrary, the intrinsic association coefficient was originally proposed by Goodman in the context of row-column (RC) association models and later identified by him as a fundamental quantity, equivalent in the odds ratio framework to the Pearson mean square contingency coefficient φ2 or to Cramér’s V. Yet it has not been used in empirical applications. We have shown that this index is actually equal to the standard deviation of all the log-odds ratios that can be constructed from a table and that it is directly related to the layer coefficient estimated by the Unidiff model. Finally, we have proposed a normalized variant of the intrinsic association coefficient varying between 0 and 1 which is equivalent to the correlation coefficient under a bivariate normal distribution, in order to make the interpretation and the presentation of results more intuitive.
Despite the very strong links between the three indices, it seems that the intrinsic association coefficient should be preferred to the Altham index. Indeed, the Altham index mechanically increases with the number of rows and columns of the table, making its scale somewhat arbitrary and its interpretation difficult. This is not the case of the two variants of the intrinsic association coefficient, which on the contrary allow for (careful) comparisons across different classifications, for example, between different socioeconomic classifications, or even between socioeconomic and educational dimensions of homogamy or social mobility. Other advantages can be highlighted: The intrinsic association coefficient fits very well in the framework of association models, since it appears directly in their equations; the normalized version varying between 0 and 1 is measured on a more easily interpretable scale.
We hope that these indices can be useful for empirical research regarding at least three aspects. First, they allow comparing the overall strength of the association as predicted by several, possibly very different log-linear or log-multiplicative models. This is particularly useful for models which do not provide a single parameter summarizing the strength of the association. As we have shown above regarding educational and socioeconomic homogamy among European regions, these indices therefore make it easy to test multiple specifications and check whether results are robust, which can in some cases prevent drawing incorrect conclusions.
Second, using one of the indices proposed in the present article will make it possible to compare results of several studies after the fact (as in a meta-analysis). This is currently hindered by the diversity of models used in the literature, even when the research questions and methods are very similar (as in the case of homogamy and intergenerational mobility). To this end, the insensitivity of the intrinsic association coefficient to the dimensions of the table is essential.
Third, the standardization of the measurement of the association on a single quantity should help establishing the credibility of the sociological approach to phenomena such as intergenerational mobility, notably in comparison with economic approaches based on the intergenerational elasticity or correlation coefficient. Any of the indices described here can be used for this purpose, since one index can easily be translated into the other just from published tables.
Finally, let us note that extensions of the intrinsic association coefficient can be devised to decompose the overall association into a symmetric component and a skew-symmetric component. The index can very naturally be combined with various quasi-symmetric specifications and with the skew-symmetric log-multiplicative association model proposed by van der Heijden and Mooijaart (1995). Further work would also be in order to derive confidence intervals for the nonparametric and semiparametric estimators of the indices.
Supplemental Material
Supplemental Material, SMR-2019 - General Marginal-free Association Indices for Contingency Tables: From the Altham Index to the Intrinsic Association Coefficient
Supplemental Material, SMR-2019 for General Marginal-free Association Indices for Contingency Tables: From the Altham Index to the Intrinsic Association Coefficient by Milan Bouchet-Valat in Sociological Methods & Research
Footnotes
Author’s Note
A previous version of this work has been presented at the 2015 Spring Meeting of the Research Committee on Social Stratification and Mobility (ISA RC28) in Tilburg (Netherlands). The author would like to thank Louis-André Vallet and Richard Breen for their comments.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
