Abstract
Outlying observations are often disregarded at the sacrifice of degrees of freedom or downsized via robust loss functions (e.g., Huber's loss) to reduce the undesirable impact on data analysis. In this article, we treat the outlying status of each observation as a parameter and propose a penalization method to automatically adjust the outliers. The proposed method shifts the outliers towards the fitted values, while preserve the non-outlying observations. We also develop a generally applicable algorithm in the iterative fashion to estimate model parameters and demonstrate the connection with the maximum likelihood based estimation procedure in the case of least squares estimation. We establish asymptotic property of the resulting parameter estimators under the condition that the proportion of outliers does not vanish as sample size increases. We apply the proposed outlier adjustment method to ordinary least squares and lasso-type penalization procedure and demonstrate its empirical value via numeric studies. Furthermore, we study applicability of the proposed method to two robust estimators, Huber's robust estimator and Huberized lasso, and demonstrate its noticeable improvement of model fit in the presence of extremely large outliers.
Introduction
Extensive research has been conducted to deal with outliers that can impede proper summarization of the overall tendency in a data set. The existing methods range from two-step approaches of first identifying the outliers and then reducing its effect (Kosinski 1999; Pena and Yohai 1999) to robust regression procedures (Bloomfield and Steiger 1983; Huber 1973; Beaton and Tukey 1974; Huber and Ronchetti 2009; Chambers and Heathcote 1981). Along the direction of adopting an explicit error distribution to account for the outlier data, Lange et al. (1989) used a heavy-tailed t-distribution to accommodate the outliers. Hadi and Simonoff (1993) provide a good literature review on the topic of outlier detection and robust regressions. More recently, Lee et al.(2012) introduced a penalization method with the parameters indicating the outlying status of all the observations, called ‘case-specific parameters’, and a lasso penalty (Tibshirani 1996) for the case-specific parameters to identify and downsize the outliers in an automated fashion. Case-specific parameters are devised to deal with each observation and should be penalized to avoid over-saturation. The method of Lee et al. (2012) with case-specific parameter turned out to be identical to Huber's estimator (Huber 1973) and Huberized lasso (Rosset and Zhu 2007), depending on the incorporation of lasso penalty. (See Section 3.1 of Lee et al. (2012) for the details.)
In this work, we are particularly interested in robust modelling of a data set containing massive outliers. Specifically, we propose a new penalty function for the case-specific parameters which are tailored to identify the influential outliers and to effectively reduce their impact on the fitted values. For the asymptotic behaviour of the proposed method, we consider the non-standard situation where the proportion of outlying samples remains non-zero as sample size increases.
We start with a classical linear regression model,
When there are massive outliers, diminishing the effect of outliers is often not sufficient. In contrast, the suggested methods have an effect of shifting the detected outliers onto the fitted regression line. Thus, most of the undesirable effects can be removed effectively without deleting the observations, which is our major contribution to the literature. Another advantage could be the broad application of this idea to general modelling procedures beyond ordinary least squares (OLS).
In Section 2.1, the estimation procedure is cast into the convex function optimization (Rockafellar 1997). The estimation of β and
As lasso (Tibshirani 1996) can do variable selection, which is a crucial part of data analysis, we extend the proposed method to lasso and Huberized lasso (Rosset and Zhu 2007) in Section 3. In Section 4, simulation studies are conducted to illustrate the suggested methods and to show the improvement compared to the methods by Lee et al. (2012) and other robust methods. Further simulation studies are contained in the supplementary materials. The applications of the proposed methods to real data sets are given in Section 5, where the first data set is a well-known small data set and the second data set is a larger data set with 3,000 samples.
Model estimation
We set X to be a design matrix of size n by
Note that the form of the penalty for
The aforementioned model identifiability constraint for model (1.2) implies that
Note that the degrees of freedom (df) for OLS model have been traditionally defined as the number of independent residuals, that is, when we use p variables with sample size n, df is
Next,
The convergency of the iterative algorithm for convex function minimization is widely studied by Rockafellar (1997). Specifically, the convergency with case-specific parameters is discussed in Lee et al. (2012). Proposition 3 in Lee et al. (2007) can be directly applied to show the convergence of Algorithm 1 to unique minimizer (
Next, we summarize the algorithm that can provide more general application beyond the discussed squared error loss function. Algorithm 1:
Initialize Update Update Update Iterate between Step 2 and Step 4 until
But, there are certain restrictions of the application such as binomial loss where Y is not quantitative. In this case, it is hard to understand and/or interpret the adjustment in Step 3. We set
A toy example in Figure 1 illustrates how the fitted line and
In Figure 1, only
A toy example for the illustration of Algorithm 1 with 10 observations where the first observation
is highly corrupted. ‘1st fit’ is the initial fit with unadjusted observations, and ‘2nd fit’ is the fit after one iteration. The ‘final fit’ is from the estimates when the Algorithm 1 is converged after two iterations. ‘Deleted’ stands for the fitted line by OLS after removing
. The thick solid line (in red) represents the underlying truth.
Because we target detecting highly contaminated data by non-zero
Under the assumption of
We investigate the asymptotic properties of
Conditions 1 and 2 are the standard moment conditions for X. Condition 3 arises from the assumption that the proportion of outliers
Outlier-shifting in robust loss function
A well-known robust approach for improving the least squares method is Huber's loss function:
The Huber's approach shares the idea of M-quantile (Breckling and Chambers 1988) for robust regression. The two methods have the same form of loss function when M-quantile targets conditional median. Thus, we naturally see that M-quantile regression might get benefit by adopting the proposed outlier-shifting penalty for the quantile regression. We may need to adjust the form of the proposed penalty due to the asymmetric loss function in quantile regression except for the median. The details of adopting the proposed penalty in the light of quantile regression will not be considered here; here is the sketch of the idea. Under the linear model of
In this section, we switch to the often encountered high-dimensional problem with a large number of covariates. To address this problem, the lasso-type penalties are widely used. The idea of case-specific parameters and shifting outlier penalty is directly applicable to lasso (Tibshirani 1996) and Huberized lasso (Rosset and Zhu 2007) to improve robustness. Lasso is defined as finding a minimizer of
To estimate
With initial values of
Rosset and Zhu (2007) considered Huberized lasso where the squared error loss in (3.2) is replaced by Huber's loss function. In fact, Huberized lasso is coincidentally the same as what proposed by Lee et al. (2012) with case-specific parameters. Analogous to Section 3.1, we can add the outlier-shifting penalty to Huberized lasso to intensify the robustness. Then, the outlier-shifting Huberized lasso is defined as the minimizer of
Now, treating the negative log likelihood as a loss function, we consider a specific form of Algorithm 1 with normal likelihood. Because Initialize Update Update Update Update Iterate Steps 2 to 5 until
Interestingly, this outlier adjustment method also shares some similarity with the Expectation–Maximization (EM) algorithm (Dempster et al. 1977) in terms of shrinkage towards the fitted value in an iterative manner. A clear distinction between the two is that the outlier adjustment method automatically determines the unidentified outliers via the tuning parameter
Simulation studies
We conduct two sets of simulations, investigating various sizes and proportions of contaminated observations. The second set is in the supplementary materials. In the simulations, we compare the suggested estimators to existing methods. They are (i) ordinary least squares (OLS); (ii) outlier-shifting least squares (
For
In all the simulation studies, we generate
Point estimate of MSE, and its standard error (in parentheses) multiplied by 1000 based on 500 simulated data sets obtained by OLS,
,
,
,
,
,
,
and
under
= 3, 6 and 10 with the contaminated proportions of 10%, 20% and 30%.
Point estimate of MSE, and its standard error (in parentheses) multiplied by 1000 based on 500 simulated data sets obtained by OLS,
,
,
,
,
,
,
and
under
= 3, 6 and 10 with the contaminated proportions of 10%, 20% and 30%.
As expected, we see that OLS is the most vulnerable to outliers. MSE values obtained by the methods of
Distribution of estimates from 500 simulated samples under the simulation set 1. See the text for the details of simulation setting. Horizontal lines represent the true regression coefficient.
As the four regression parameter penalized methods (
From Table 2, the four methods select more variables than they should as the false positive rates are considerably high. As a consequence, the true positive rates are all close to 1. The false discovery rates from the suggested methods are higher than their competitors.
Average false negative rate of three non-zero coefficients and average false positive rate of five zero coefficients (in parenthesis) based on 500 simulated data sets obtained by
,
,
and
with
= 6 and 10 with contaminated proportions of 10%, 20% and 30%
Finally, we examine the performance of
Averaged probability of identifying the outliers with
based on 500 simulated data sets obtained by
,
,
and
with contaminated proportions of 10%, 20% and 30% and magnitude of
= 3, 6 and 10
Analysis of stack loss data
The stack loss data set (Brownlee 1965) has been examined by many researchers (Daniel and Wood 1980; Chambers and Heathcote 1981; Carroll and Ruppert 1985). The details of pre-analysis can be found in Chapter 5.4 of Daniel and Wood (1971), and using non-linear covariates are included in Denby and Mallows (1977).
The data set has 21 observations with 3 explanatory variables. The data describe the operation of a plant. Researchers have sought to explain the percentage of unconverted ammonia that escapes from the plant (Y) by the flow of cooling air (
Estimates of
,
,
,
for the stack loss data. Data 1, data 2 and data 3 stand for the data set with 21, 19 and 17 observations, respectively
To measure the sensitivity of the estimates as we remove two and four observations, the absolute differences between the minimum and the maximum estimates among the three data sets are calculated. For example, sensitivity of OLS for
Sensitivity of five estimators in Table 5.1 at
,
,
with the stack loss data
Among the four aforementioned methods, the proposed method is the most stable; That is, the amount of change in the estimates is the smallest for the suggested method as we change the sample size by removing two and four outliers.
Figure 3 shows the residual plots from the four methods of OLS,
Residual plots from the four methods of OLS,
,
and
with all observations. In the
, the four observations with
are detected as outliers (observations 1, 3, 4 and 21) and shifted by our method.
For analysis of real data, we numerically compare the eight models (OLS, wage: Workers raw wage in 1,000 USD year: 2003 to 2009 age: 18 or above married: A factor with levels (a) Married and (b) The others (composed of never married, widowed, divorced, separated) race: A factor with levels (a) White, (b) African American and (c) The others edu: A factor with levels (a) Less than high school graduate (b) High school graduate, (c) Some college, (d) College graduate and (e) Advanced degree class: A factor with levels (a) Industrial and (b) Information indicating type of job health: A factor indicating health status of (a) Good or less than good and (b) Very good ins: A factor for status of having health insurance. (a) Yes and (b) No
We perform pre-analysis and modify the original variables slightly and do log transformation of the response variable due to heteroscedasticity. For the variable married, we combine the four categories of never married, widowed, divorced and separated into one category since there seems little difference. Similarly, the others category in race includes Asian which was a separate category in the original data in
The mean of 500
As the improvement by our methods is quite minor, the values of the estimates by the above methods are similar in general. When income is treated as a response variable in economic data, extreme outliers are often detected even if the response variable is log transformed. A robust model may improve the prediction as our analysis shows here.
Mean of
based on 500 random split of mid-Atlantic wage data by models of OLS,
,
,
,
,
,
and
. All values are multiplied by 100.
In this work, we propose a method of automatically adjusting severely outlying observations by using case-specific parameters and shifting outlier penalty to reduce the undesirable impact. We provide a default value for the threshold of the outliers to capture huge outliers and to attain consistency. The outlier shifting trace is demonstrated with the example displayed in Figure 1. In terms of model estimation, we demonstrate the equivalence between the maximum likelihood-based and convex minimization approaches in the case of least squares estimation. The convex minimization approach enables applying the proposed method to a wide range of modelling procedures, such as Huber's regression, lasso and Huberized lasso. Our empirical investigation shows that the advantage of the proposed method in accurately capturing the true regression surface increases with the extent of data contamination. Despite of our focus on linear models in this work, the new method can be tailored to non-linear models involving convex minimization problems. We will extensively investigate more complicated cases in the future work.
Appendix
Footnotes
Acknowledgments
Jung's work was supported by University of Waikato Research Trust (#103406), Lee's work was supported by Hankuk University of Foreign Studies Research Fund of 2015 and Hu's work was partially supported by the National Institute of Health Grants R01GM080503 and R01CA158113.
