We would first like to thank Gerhard Tutz and Jan Gertheiss for their profound review of regularization methods for categorical variables, either as covariates or as response variables in regression models. Categorical variables are rather the rule than the exception in regression analyses, particularly in the medical, social and economic sciences. It is amazing that it took quite a long time from ridge regression (Hoerl and Kennard, 1970) to versions of regularization methods, which take into account the specific structure of categorical covariates. As Tutz and Gertheiss predominantly discuss penalized estimation, we will focus on the Bayesian perspective to regularization and effect fusion for categorical covariates.
Regularization and sparsity in regression type models have been addressed from a Bayesian point of view in the literature, see for example, Fahrmeir et al. (2010) for an overview on Bayesian regularization and (George and McCulloch 1995), (Ishwaran and Rao 2005) and (O'Hara and Sillanpäa 2009) for approaches to Bayesian variable selection. However, the specific issues arising for categorical covariates have not yet received much attention, with the exception of Chipman (1996), who considers a Bayesian approach for groupwise selection of level effects of a categorical covariate (the Bayesian pendant to Section 3.2.1 of Tutz and Gertheiss 2016) and the recently proposed Bayesian pendant to smoothing predictors and effect fusion based on spike and slab priors in Pauger and Wagner (2016).
Here, we will discuss the relation between regularization penalties and prior distributions and then describe a Bayesian approach for smoothing of level effects of categorical covariates. Finally, we will briefly describe spike and slab prior distributions which encourage a sparse representation of their effects.
Bayesian regularization of effects of categorical covariates
Specification of a Bayesian linear regression model requires not only a model for the data, for example, the linear regression model (3) of Tutz and Gertheiss (2016),
but also a prior distribution on the model parameters , that is, . The joint posterior distribution is then given as
respectively taking logarithms as
Obviously, the log-posterior corresponds to the penalized log-likelihood with
Consider, for example, a prior distribution of the structure with a Normal distribution for the regression coefficients, . Then the log-posterior is given as
and the Bayesian posterior mode estimate is equivalent to the regularized ML-estimate with .
Bayesian regularization is based on specifying a prior distribution where the hyperparameters include parameters controlling complexity, shrinkage or smoothing, see Fahrmeir et al. (2010). An advantage of Bayesian regularization is that no cross-validation is needed, as these hyperparameters are estimated jointly with the parameters of interest and, hence, uncertainty on the regularization parameters can also be assessed.
Therefore, to achieve Bayesian regularization for the effects of categorical covariates, appropriate prior distributions for the coefficients are needed. As pointed out in Tutz and Gertheiss (2016), to smooth effects of categorical covariates, it is essential to take into account their scale: For ordinal covariates, coefficient smoothing is equivalent to shrinking effect differences of adjacent categories to zero. In contrast, there is no natural order of the categories for a nominal covariate and, therefore, smoothing of coefficients implies to shrink all pairwise effect differences to zero. We discuss prior distributions for Bayesian regularization based on the appropriate effect differences for ordinal covariates in Section 2.1 and for nominal covariates in Section 2.2. To specify these prior distributions, following Tutz and Gertheiss (2016), we assume that the effect of the reference category is defined as .
Ordinal covariates
According to Tutz and Gertheiss (2016), Section 3.1.1., if all covariates are ordinal, smoothing of the effects of adjacent categories can be enforced by a quadratic penalty on the effect differences ,
Here, is the vector of all effect differences and denotes the effect differences of the -th covariate. The quadratic penalty (2.2) corresponds to the specification of independent Normal priors on all which share the scale parameter , i.e.
To perform Bayesian effect smoothing, it is natural to allow for covariate specific scale parameters. We assume prior independence across covariates and specify the prior on hierarchically as
where is a unit matrix of dimension and denotes the inverse Gamma distribution.
The reparameterized effects are the regression effects for split-coded levels of the -th covariate. Both and the regression effects for dummy-coded levels are of the same dimension and uniquely related by
where is a lower triangular matrix of dimension with elements . Hence, the resulting prior distribution on is , where
Nominal covariates
As noted above, for nominal covariates, no natural order of the categories exists and, hence, smoothing of category effects implies shrinking of all effect differences
to zero. To construct such a prior, we denote by the vector of all effect contrasts. The first elements of are the effect contrasts with respect to the reference category and correspond to and hence , where is the vector of all effect contrasts not involving the reference category. Elements of are linearly dependent, as
Hence, the smoothing prior for can be constructed from a prior distribution on all effect differences by taking into account the linear restrictions
where is the appropriate coefficient matrix and is the corresponding vector of zeros.
Slightly deviating from the corresponding prior for an ordinal covariate, the prior on the effect differences is specified as
where takes into account the number of levels. As shown in Pauger and Wagner (2016), the resulting prior on under the linear restrictions (2.4) is where
Posterior inference
In a linear regression model with different types of covariates (metric, ordinal, nominal) assuming prior independence across covariates, the resulting prior distribution for the vector of all regression coefficients is a Normal distribution , where is block-diagonal with elements . With standard priors for the intercept and the error variance , posterior inference using Markov chain Monte Carlo (MCMC) methods is straightforward and requires only standard Gibbs sampling steps:
Sample from the full conditional multivariate Normal distribution.
Sample the error variance from the full conditional inverse Gamma distribution.
For , sample the scale parameter from its full conditional inverse Gamma distribution.
Application
To illustrate Bayesian effect smoothing, we reanalyze the Munich rent standard data used in Gertheiss and Tutz (2010). The response variable monthly rent per square meter is modelled as a function of five binary (V1–V5), four ordinal (V6: ‘year of construction’, V7: ‘number of rooms’, V8: ‘quality of residential area’ and V9: ‘floor space’ in ordered classes) and one nominal covariate (V10: ’urban district’ with 25 categories), see Gertheiss and Tutz (2010) for a more detailed description of the data set. To perform Bayesian analysis, we choose priors for the regression coefficients as described in Sections 2.1 and 2.2 and uninformative priors for the intercept and the error variance . We set and vary the scale parameters in a wide range. Figure 1 shows paths of the resulting coefficients as a function of the corresponding scale parameter .
Coefficient paths for the ordinal predictors ’year of construction’, ’number
of rooms’, ’quality of residential area’ and ’floor space’ and the nominal predictor ’urban district’.
As the expected prior variance of the effect differences decreases with , we see increasing shrinkage of effect differences to zero and smoothing of the level effects. However, different from smoothing and fusion penalties, except for very small values of , effect differences are not exactly zero.
Spike and slab priors
Sparsity in the representation of the effects of a categorical covariate can be achieved by replacing the Normal prior on the elements of by spike and slab priors. These priors are a mixture of two components: a spike at zero which shrinks small coefficients aggressively to zero and a slab component with its mass spread over a range of plausible values, which induces only minor shrinkage. Spike and slab priors are specified hierarchically by introducing indicator variables which take the value one, if is assigned to the slab component. The prior on is given as
where is a large value. The inclusion probability can be fixed or assigned a Beta prior . We subsume all indicators associated with the -th covariate in the vector .
For a nominal covariate, we derive in Pauger and Wagner (2016) the structure of the prior variance matrix of and show that implies that the prior partial correlation of and is close to .
Posterior sampling requires to add only the following two simple sampling steps to the scheme described in Section 2.3, if a hyperprior is specified for :
For sample the binary indicators from and update the prior variance matrix
For sample from its full conditional Beta distribution;
otherwise sampling is omitted.
To perform Bayesian effect fusion for the Munich rent data, we set , , and for all covariates. We run MCMC for 20 000 iterations after a burnin of . Level effects are fused based on an estimated posterior inclusion probability and the resulting model (Model 1) is reported in Table 1.
Munich rent data, Model 1: Model selected by Bayesian effect fusion (grouping of level effects is indicated by spaces)
For covariates and , the same levels as in the model selected in Gertheiss and Tutz (2010), which we will denote by Model (2) are fused. However, fusion is more pronounced in the Bayesian version for the ordinal covariates (7 vs. 8 groups) and (6 vs. 7 groups), and particularly for the nominal covariate , with only two groups under Bayesian fusion and 10 groups using regularization, see also Table 2.
Munich rent data, Model 2: Model selected by regularized regression (grouping of level effects is indicated by spaces)
Finally, in Table 3, we compare the full model, that is, the model with separate regression effects for each covariate level to refits of the two models selected under the different approaches.
Both Models 1 and 2 have considerably fewer parameters as well as lower values of the information criteria Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) than the full model.
AIC is smaller for the model obtained by regularization (Model 2), whereas BIC is smaller
for the model selected by the Bayesian approach (Model 1), which illustrates that the Bayesian version of effect fusion implies a higher penalty.
Munich rent data: Model selection criteria for different models
Number of coefficients
AIC
BIC
Full Model
58
8 651.4
8 983.4
Model 1 (Bayesian Fusion)
22
8 616.5
8 745.9
Model 2 (Fusion Penalty)
32
8 604.3
8 789.9
Footnotes
Acknowledgments
This research was supported by the Austrian Science Fund (FWF), P25850.
References
1.
ChipmanH (1996) Bayesian variable selection with related predictors. The Canadian Journal of Statistics, 24, 17–36.
2.
FahrmeirLKneibTKonrathS (2010) Bayesian regularisation in structured additive regression: A unifying perspective on shrinkage, smoothing and predictor selection. Statistics and Computing, 20, 203–19.