Oracle inequalities provide probability loss bounds for the lasso estimator at a deterministic choice of the regularization parameter and are commonly cited as theoretical justification for the lasso and its ability to handle high-dimensional settings. Unfortunately, in practice, the regularization parameter is not selected to be a deterministic quantity, but is instead chosen using a random, data-dependent procedure, often making these inequalities misleading in their implications. We discuss general results and demonstrate empirically for data using categorical predictors that the amount of deterioration in performance of the lasso as the number of unnecessary predictors increases can be far worse than the oracle inequalities suggest, but imposing structure on the form of the estimates can reduce this deterioration substantially.
We would like to congratulate Gerhard Tutz and Jan Gertheiss (hereafter TG) on their impressive tour through the world of regularized regression for categorical data. Categorical predictors arise often in practice, bringing with them problems of potentially overspecified models that cannot be well-fit without the imposition of additional structure in the model. Regularization methods provide a natural way to impose such structure, as TG demonstrate.
Deterioration of the lasso
There is no question that regularization methods have become very popular in recent years. This popularity has come from three related factors: the availability of very high-dimensional data, the fact that (unlike previously available methods) regularization approaches can handle such data and the belief that the consequence of using such methods (compared to classically ‘optimal’ ones) is small.
This belief comes from so-called ‘oracle inequalities’ probability bounds on the loss
where X is the n × p matrix of predictors, is the true vector of regression coefficients, is the lasso-estimated vector of coefficients for a specific choice of the regularization parameter and is the squared Euclidean norm. Roughly, for a particular deterministic choice of , these probability bounds are of the form
(Bühlmann and van de Geer, 2011: 102). Here is the true error variance, is the number of non-zero slope coefficients and is a constant, that does not depend on . Apart from the term and the constant, this equals the loss expected if an oracle told us the true set of predictors and we fit least squares, so in this sense, this factor is the (relatively small) price that is paid to gain the wide applicability of the lasso (and in particular the ability to select out a large number of unneeded predictors). See, for example, Bühlmann, (2013), Candès and Plan, (2009) and Vidaurre et al., (2013).
Unfortunately, these inequalities do not correspond to practical implementation of the lasso. In practice, λ is selected using a data-dependent method, such as using an information measure, generalized cross-validation or k-fold cross-validation (see, for example, Fan and Li, 2001; Wang et al., 2007; Zou et al., 2007; Zhang et al., 2010; and Flynn et al., 2013, for applications of these selectors to penalized regression estimators). Chatterjee (2014) showed that the existing theoretical results do not apply to data-dependent choice of λ, so it is unclear what oracle inequalities say about practical performance of the lasso.
We (Flynn et al., 2016) examined the question of the effect of the presence of spurious predictors on the lasso when using a data-dependent choice of . Specifically, in the simple case of orthonormal predictors and a single predictor with non-zero coefficient , the probability that the loss of the lasso using the optimal data-dependent choice of () when including predictors is larger than that when using the optimal choice based on one predictor with non-zero slope () is arbitrarily close to 1 for an appropriately high signal to noise ratio and large ; specifically,
Theorem: For all ,
where is the cumulative distribution function of a standard normal random variable.
That is, the lasso deteriorates when additional superfluous predictors are added to the model being fit even when the amount of regularization is chosen optimally, something that is not true for estimation based on classical variable selection. Further, the expected amount of deterioration is infinite:
Theorem: For all ,
Computing the probability of deterioration in the more general case where is -sparse (i.e., exactly predictors have non-zero coefficients, for some ) is possible with an exact formula for . In Flynn et al., (2016), we demonstrate this calculation for and investigate more realistic situations using simulations. The simulations show that the amount of deterioration can be far greater than oracle inequalities would imply, highlighting the danger in adding predictors with zero coefficients to the model, and the potentially misleading nature of thinking of as the ‘price’ of using the lasso to weed out large numbers of (unnecessary) potential predictors.
Categorical predictors
It is clear that this problem is a potentially serious one for data with categorical predictors, as each such predictor with categories adds variables to the design. For example, the data set analyzed by TG in Section 2 has only 22 categorical predictors, but this corresponds to 91 predictors from the point of view of the underlying regression model. The inclusion of interaction effects would obviously further magnify these numbers greatly. On the other hand, the group-related penalties discussed by TG effectively reduce the available degrees of freedom by imposing structure on the allowed slope estimates, potentially reducing the seriousness of this problem. In this section, we explore this question using simulations.
The underlying structure here is fairly simple. There are potential categorical predictors, or , each having five levels (note that here is the number of categorical predictors, not the number of underlying variables used to code any effects in fitted regressions). Only the first predictor has any predictive power for the response , with for the five levels. The underlying model is for the th observation, with and . The design is balanced, with observations at each of the design points, making the total sample size . On each simulation run, for a given estimator, first the correct true model is fit (i.e., the one based only on the variable ) and the minimum average squared error of the fits (over a grid of values of the regularization parameter ranging from zero to the value that shrinks all the estimated slope coefficients to zero) is recorded, and then the model using all predictors is fit and the minimum average squared error of the fits is determined. The ratio of these two values is the deterioration in best possible performance that comes from adding unnecessary predictors, and those values are averaged over the simulation replications. Categorical variable effects are fit based on indicator variable coding, and models for can also include interaction effects. All runs are based on 1 000 simulation replications.
Table 1 gives the results for different estimators. The first set of numbers refers to the ordinary lasso which does not take into account any of the categorical (group) structure. It is apparent that even in the case where , there is serious deterioration in the optimal performance of the lasso, particularly when the two-way interaction is included (note that in that case, the number of potential predictors is 24, compared to the need for only four in the true model). As increases, deterioration even with only main effects goes up, and when including interactions (corresponding to 124 predictors for the highest-order interaction model when and 624 predictors when , respectively), the optimal average squared error can be 15 to 30 times that when using only . These patterns (and those for the other estimators) are relatively insensitive to sample size and the value of , based on simulation results not presented here.
Simulated average deterioration for different estimators
Lasso
Model
p = 2
p = 3
p = 4
Main effects only
2.457
3.706
4.489
Two-way interactions
7.011
11.149
14.828
Three-way interactions
15.818
23.411
Four-way interaction
28.036
Group Lasso
Model
p = 2
p = 3
p = 4
Main effects only
2.003
2.182
2.028
Two-way interactions
2.789
2.850
3.070
Three-way interactions
2.959
3.163
Four-way interaction
3.180
Ordinal group Lasso
Model
p = 2
p = 3
p = 4
Main effects only
1.452
1.491
1.361
Of course, the ordinary lasso does not take into account the group structure implied by the categorical predictors. The picture is very different for the group lasso, the next entries in the table. While there is still substantial deterioration, with the optimal loss when using unnecessary predictors as much as three times that when using only , it is much smaller than that for the ordinary lasso. It is also not very sensitive to the complexity of the model being fit, even for the model including the four-way interaction when . The ordinal lasso can only handle models with main effects, but while deterioration as increases is still noteworthy, it is more limited, corresponding to squared error less than 50% higher than when modelling based on alone. This is consistent, of course, with the notion that putting more structure on the types of models allowed to be fit results in less of a problem from including unnecessary predictors.
Conclusion
The message from these results is apparent. Even in a world of ‘big data’, statistical models still matter. The less a data analyst is willing to assume, the bigger the potential price that could be paid by appealing only to automatic dimension reduction. The kinds of estimators discussed by TG can provide the important advantages that carefully reasoned appropriate assumptions about statistical structure offer while still being available for use with high-dimensional data. It is reasonable to think that this message will also apply far more widely than only when dealing with categorical data.
References
1.
BühlmannP (2013) Statistical significance in high-dimensional linear models. Bernoulli, 19, 1212–42.
2.
BühlmannPvan deGeer S (2011) Statistics for High-Dimensional Data. Springer: New York.
3.
CandèsEJPlanY (2009) Near-ideal model selection by ℓ1 minimization. Annals of Statistics, 37, 2145–77.
4.
ChatterjeeS (2014) A new perspective on least squares under convex constraint. Preprint arXiv: 1402.0830.
5.
FanJLiR (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–60.
6.
FlynnCJHurvichCMSimonoffJS (2013) Efficiency for regularization parameter selection in penalized likelihood estimation of misspecified models. Journal of the American Statistical Association, 108, 1031–43.
7.
FlynnCJHurvichCMSimonoffJS (2016) On the sensitivity of the lasso to the number of predictor variables. Preprint arXiv: 1403.4544.
8.
VidaurreDBiezlaCLarranagaP (2013) A survey of L1 regression. International Statistical Review, 81, 361–87.
9.
WangHLiRTsaiC-L (2007) Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553–68.
10.
ZhangYLiRTsaiC-L (2010) Regularization parameter selections via generalized information criterion. Journal of the American Statistical Association, 105, 312–23.
11.
ZouHHastieTTibshiraniR (2007) On the ‘degrees of freedom’ of the lasso. Annals of Statistics, 35, 2173–92.