Abstract
This discussion is a continuation of Tutz and Gertheiss (2016)’s paper, where we focus on the importance of the coding of effects in regularized categorical and ordinal regression. We show that, though that an appropriate regularization is profitable for any coding, the choice of a relevant coding can prevail over the one of the regularization term for revealing structures. We focus on predictors though the issues raised also apply to responses. We illustrate our point on a classic data set.
Introduction
Regularized approaches have been at the forefront of statistical research in the last two decades. In this context, regression with metric features has been particularly studied (Bühlmann and Van De Geer, 2011; Giraud, 2014). As a result, a number of methods are now available to efficiently cope with the curse of dimensionality and high-dimensional data (the
As (Tutz and Gertheiss (2016)) state, ‘with the right penalty being chosen, [...] structures in the data can be revealed’. In this discussion, we would like to stress that the choice of a specific coding can also be useful to reveal structures in regularized methods. We show that using an appropriate combination of effect coding and penalty, it is possible to achieve good generalization prediction and interpretable models. Conversely, an ill-suited coupling of coding scheme and penalty has a detrimental effect on prediction and may lead to misinterpret the type and the importance of the role of categorical variables in the model. This completely jeopardizes the original goal of sparsity-inducing regularization, which is generally introduced to bring more interpretability in the model. Note that although we focus on predictors, the issues raised here also apply to categorical responses.
Coding systems and regularization
In the context of standard (non-regularized) linear models, a categorical predictor is usually represented through several auxiliary variables (e.g., dummy variables). Several choices are possible, arguably the most popular being the dummy coding. Importantly, this choice does not affect the prediction, since the vector space spanned by the design matrix does not depend on the coding. However, in regularized regression, this is no longer the case: applying the same penalty to different codings leads to different predictions, as illustrated in the following section.
A simple motivating example
Consider the simple example of one-way ANOVA (Analysis Of Variance) regularized by a ridge penalty: we aim at adjusting a linear model for predicting a vector of observations
Now, consider the two following classic coding schemes for the design of the ANOVA:
The first coding
Clearly, the estimation and the predictions
An adequate coding for categorical variables—including the choice of a reference among the levels—should reflect a knowledge about the levels and their relationships. We split the problems into three classes: (a) ordinal variables with quantifiable and known differences between levels, (b) categorical variables without any ordering between levels and (c) ordinal variables without quantified differences between levels.
Besides coding, the prior knowledge should also drive the choice of a particular penalty for regularizing the problem. To be more specific, we stick to the framework of generalized linear model (GLM), in which case the general regularized regression problem is
Contrasts and codings for a predictor with four levels: Usual contrast (dummy coding) and contrast for adjacent levels, with the corresponding backward difference coding
We illustrate here the importance of the coding scheme on a classic data set, which is particularly well-suited to illustrate our message. Indeed, we can confidently postulate that the nine ordinal predictors at play have 10 equally spaced levels each. First, we show that using an appropriate coding with its accompanying penalty, it is possible to do about as well as if an oracle had given the order and the correct differences between levels. Second, we show that using a standard but inappropriate coding and/or an inappropriate penalty can severely deteriorate prediction.
Dummy coding with first level as reference and Lasso penalty. Dummy coding with first level as reference and group-Lasso penalty. Dummy coding with no reference and Lasso penalty. Dummy coding with no reference and group-Lasso penalty. Backward difference coding with coop-Lasso penalty.
Note that we never penalize the (global) intercept term.
Predictive performance
In the framework of logistic regression, the statistical performances are classically measured with the classification error and the binomial deviance of the fit. To estimate these two quantities, we randomly split the original data set (
Estimated performance for the breast cancer data set in terms of binomial deviance (left panel) and classification error (right panel) for numeric coding with maximum-likelihood (the ideal case), backward difference coding with coop-Lasso, dummy coding (without reference and with first level as reference) with Lasso and group-Lasso.
Estimated performance for the breast cancer data set in terms of binomial deviance (left panel) and classification error (right panel) for numeric coding with maximum-likelihood (the ideal case), backward difference coding with coop-Lasso, dummy coding (without reference and with first level as reference) with Lasso and group-Lasso.
The combination of difference coding with the coop-Lasso penalty does almost as well as the reference MLE model in terms of deviance and classification error, and better than all other regularization alternatives that do not take into account the order between levels. Note that in a number of resamples, the MLE suffers from a lack of regularization which explains the large dispersion of the observed MLE deviance. In contrast, thanks to regularization, the dispersion of the coop-Lasso deviance is smaller. We, thus, observe here that an appropriate combination of the coding system and the regularization penalty is fruitful.
Dummy codings with the first level as an arbitrary reference lead to solutions that are clearly dominated by all the other ones. Interestingly, dummy codings without a reference level, which still do not account for the ordering of levels, lead to a striking 30 to 40 per cent improvement of the classification error and deviance compared to the previous dummy coding option. Note that the use of a reference in dummy coding is somewhat a heritage from the non-regularized set-up, where it numerically stabilizes the optimization process. In the regularized set-up, we suggest to avoid the use of such a reference: not only it may strongly deteriorate predictions, but it also jeopardizes coefficient interpretation, as seen in the following section.
Finally, we observe that group penalties, which introduce a coupling between the levels of a predictor, compare favourably to the Lasso penalty for the two dummy coding schemes. However, this beneficial effect is rather marginal compared to the large differences observed between coding alternatives: within the combination explored here, choosing the right coding is more instrumental than fine-tuning the penalty to introduce an appropriate bias.
We have shown that a relevant combination of coding and penalty pays off in terms of predictive performances; we pursue by showing that it also benefits to the interpretability of the model, which is a key asset of sparsity-inducing regularization. To this purpose, we display the regularization paths, describing the solutions as the penalty parameter
We adjust all methods on the whole data set. Here also, the reference model is logistic regression with numeric coding. The regression parameters for the logistic regression with numeric codings are reported in Table 2, together with the significance of predictors, as assessed by a
Estimation of the coefficients for the logistic regression with numeric codings
Estimation of the coefficients for the logistic regression with numeric codings
Figure 2 summarizes the overall regularization paths by predictor for the three regularized methods. For groupwise penalties applied to numerical variables, groupwise regularization paths are usually displayed by plotting the
Cancer dataset: Regularization paths of predictor importance versus shrinkage factor.
Figure 2 compares the predictor importances estimated by the models fitted using: dummy coding using the first level as reference with a Lasso penalty (left), same coding with a group-Lasso penalty (middle) and backward difference coding with a coop-Lasso penalty (right). We note that the picture is clearer, much more consistent between the different values of the penalty parameter for the last option. In particular, for large values of the penalty parameter, dummy coding with Lasso and group-Lasso gauge a large importance to the ‘Single.Epithelial.Cell.Size’ predictor, which then declines for less penalized models. This predictor is not detected as significant by the MLE (Table 2), and a closer analysis (not shown) reveals that one level of the Single.Epithelial.Cell.Size is considered as highly relevant when the order information between levels is not taken into account. The model fitted with backward difference coding combined with the coop-Lasso, which takes into account the order of levels never estimates a large importance to the Single.Epithelial.Cell.Size predictor.
We now show that even predictors that seem to be similarly handled by all methods when looking at the summary provided by Figure 2 may, in fact, be handled quite differently. For this purpose, we display in Figure 3 the parameters attached to each level for the ‘Clump.Thickness’ predictor, which is found significant by MLE, as shown in Table 2. These parameters are the regression coefficients for dummy coding and a reparametrization of the regression coefficients for backward difference coding.
These regularization paths reveal that for the two variants using dummy coding (left and middle), the parameters are not coherent with the numeric levels of the ‘Clump.Thickness’ predictor and not self-consistent across
Cancer dataset: Regularization paths of the coefficients pertaining to the levels of the ‘Clump. Thickness’ predictor versus shrinkage factor.
To summarize, we have shown that choosing an appropriate coding is at least as important as choosing the right penalty. We advocate that one should refrain from using arbitrary reference levels in codings, since most penalties will introduce a bias towards the resemblance of the other levels to the reference which may have important detrimental effects regarding predictive performances and interpretation. In the experiments reported here, backward difference coding with the coop-Lasso penalty appears to be the best choice. This combination turned out to be appropriate for the underlying structure of the categorical predictors at hand. Note that we do not state in any way that this combination should usually be preferred to other ones. Our message is rather that for regularized methods, exploring codings can be at least as sensible as exploring penalty terms.
