Abstract
This is a discussion on the article ‘Regularized Regression for Categorical Data’ by Tutz and Gertheiss.
Summary
I congratulate Tutz and Gertheiss (hereafter TG) for a comprehensive review on using regularized regression for modelling categorical data. A categorical variable is different from a metric variable in that at least one dummy variable is needed for one category when it is used in a model. This means that even when the total number of categorical variables is small, the resulting number of parameters needed to deal with the categories can be potentially huge. As a consequence, when the structures of these variables are of interest, the notion of sparsity is more complex than that for metric variables, and extra care is needed.
Tutz and Gertheiss (2016) have investigated two general classes of models where categorical variables can appear either as predictors or as responses. Since the main aim of this article is to introduce various penalties that can be useful for modelling categorical data, I focus on discussing some of the main issues related to this strategy and the more general question regarding modelling, motivated by the food spending data in the article.
Main comments
What model to use?
For the food spending data, presumably, there are infinitely many models one can fit. These models include the linear regression model as in equation (1) of TG with only the main effects, or the model in equation (1) with two-way interactions, or the ‘varying-coefficient’ model as in equation (9) of TG and many more. A data-driven approach to choose among these models is to use cross-validation (CV) and then pick the model that gives the smallest CV error. Chances are, however, that multiple models would fit the same data equally well. My first question is whether general guidelines could be provided as to what model(s) would be suitable for high-dimensional data, particularly for the food spending data. This is an important question for practitioners.
What penalty to use?
Having decided the model or the models to use, one is often faced with additional difficulties, one of which is overparametrization due to the large number of parameters needed, as we do not usually have enough data. This is where the notion of regularization is useful by trading unbiasedness for smaller variances. Smoothing or ridge-type regularization is useful towards this goal. Recent years have seen an increasing emphasis in reducing variance by zeroing variables or selection, as it often produces interpretable models. Nevertheless, as correctly pointed out by TG, choices need to be made with regard to what penalty to use. For the food spending data in particular, should we use a smoothing penalty, or a selection penalty, or a smoothing and selection penalty, or a fusion penalty, or some mixture of these or what? The answer to this question, of course, should presumably be guided by some prior beliefs. A difficulty is that after examining Figures 1 and 2, I have little clue which penalty would make more sense and why.
What criterion to use?
Having decided the model(s) and the penalty to use, the next important question is how to regularize. Should one use unweighted penalty or its weighted versions? Should one use CV as in TG or other information criteria for choosing the tuning parameters in the regularization? Leng et al. (2006) have highlighted that prediction-optimal criteria do not usually give models that have the same support as that of the true model. The CV approach used by the data analysis in TG is a prediction based method and is bound to find more significant variables than necessary in the asymptotic sense. An attractive alternative is to use the Bayesian information criterion (BIC) or its variants when dimension is high as advocated by Wang et al. (2009). The issue with the BIC is that it may be model dependent in the sense that it is sensitive to the assumptions made on the model error. In the context of the repeated measurements data discussed in TG, there is no obvious way to define the BIC for the fixed effects model when the within-subject correlation is ignored.
Mixed models for smoothing and selection
The connection between a generalized additive model and a mixed model allows one to estimate the tuning parameter in the former via (restricted) maximum likelihood, when the effects of categorical predictors are to be smoothed. This connection is briefly surveyed by the article. When the main interest is to select predictors, in principle, we can replace the i.i.d. Gaussian assumption on the differences as in Section 3.2.1 by an i.i.d. Laplace assumption. A disadvantage of this approach is, however, that it is not immediately clear how one may derive the maximum likelihood estimators of the tuning parameters, as this new formulation does not give tractable approximation to the marginal likelihood. I wonder whether a principled approach using the mixed model formulation can be developed when the main interest is on selection.
Mixed effects or fixed effects?
In Section 5, TG has differentiated a fixed effects model from a mixed effects model. If one is dealing with repeated measures data as in the fixed effects model in equation (19), I wonder whether there is a natural way of taking into account the correlations of the repeated measurements. Presumably, the model in equation (21) was fitted using a likelihood assuming repeated measures are independent. While this is alright as far as the consistency of the estimated parameters is concerned, a loss in estimation efficiency seems inevitable if correlations are present. The model in equation (21) also suggests an interesting way to cluster longitudinal profiles of the subjects by penalizing pairwise differences of
Conclusion
For the food spending data, it would be good to have definite answers to what model(s) is recommended, with what penalty to regularize and with what criterion to determine the tuning parameter(s).
Via the method of regularization, we now have an extra set of tools for dealing with categorical variables in a high-dimensional set-up, as clearly demonstrated by TG. Tutz and Gertheiss are to be congratulated again for bringing to our attention the versatility of these powerful tools.
