Stratified sampling in highly polluted data as an effective and reliable alternative to high breakdown point estimators

Abstract

Observations on certain real-life cases include units that are incompatible with other data sets. Values that are extreme in nature do influence estimates obtained by conventional estimators. Robust estimators are therefore necessary for efficient estimation of parameters. This paper uses stratification with simple random sampling without replacement to optimize sample allocation in stratum for efficient parameter estimation as an alternative method of handling highly contaminated samples. Our proposed method stratifies the highly contaminated population into two non-overlapping sub-populations, and stratified samples of sizes 50, 200, and 500 was drawn. We estimate the model parameters form the contaminated sampled data using ordinary least squares under the proposed method, and using the two high breakdown point estimators; the Least Median of Squares and Least Trimmed Squares. Our findings shows that the proposed method did not perform well for low contamination levels ( $\leqslant$ 30%) but outperformed Least Median of Squares and Least Trimmed Squares for higher contamination rates ( $\geqslant$ 40%). This indicates that our proposed method compares well and compete favorably with the two high breakdown point estimators.

Keywords

Stratification contamination high breakdown point ordinary least squares

1. Introduction

Studies involving several real-life scenarios contains interest measurements that vary from the rest of the results, which are often called measurement errors. Some of these inconsistent measurements, however, are genuinely calculated observations that exhibit characters that differ from the rest of the data; we find those measurements to belong to another population. The presence of incoherent observations called outliers affects many statistical procedures and the validity of their results, especially when the distribution underlying them is normal. Nonetheless, there are many outlier concepts (Barnett & Lewis, 1995; Todorov et al., 2011; Ramsy & Ramsy, 2007); outliers are pure measurements (not necessary errors) that are incompatible with most of the data and do not behave in similar mode.

Outliers can occur for reasons such as; errors in data transmission or transcription, data coding and entry, ineffective survey design and processes, defective equipment or irregular spikes or, time-series data droppings. They can also occur because of natural causes as certain individuals can demonstrate characters well above or below the average. For example, some individuals may die earlier or live longer in life than most of their contemporaries in a given community (Fisher & Waclawski, 2009). Nevertheless, extreme natural events are uncommon, but in fact, they are representative of nature since they are correctly measured. Non-representative outliers; must be detected and rectified throughout the data editing process. Outliers with multiple measurements and/or variables are more likely to occur in datasets. A quick visual inspection still does not make them easy to detect (Hubert et al., 2008). Outliers introduce effects of masking which obscures real effects. Even a small cluster of observations may inflate the empirical covariance matrix to introduce the non-existent effect called swamping which makes some observations look like outliers when the true outliers pull the empirical matrix away from the non-outliers (van der Linde & Houle, 2006; Wilcox, 2011; Elliot & Stettler, 2006).

When assumptions of normality are broken due to the existence of outliers, the classical multivariate parameter estimation methods do not work well, because they are focused on; the empirical means, covariance and correlation matrices, and the least square fitting; all of which are highly influenced by only a few outliers that make them unstable (Croux et al., 1994). When addressing outliers’ problems, it is sometimes tempting to replace extreme values with plausible ones assumed by the researcher or remove them outright from the data set; however, extreme values are only removed if there are sufficient technical reasons for their aberrant behavior (Burke, 1988). Statistical criteria alone are not adequate to exclude extreme values from a data set, says Béguin and Hulliger (2008). For example, in fraud detection, network intrusion detection, and clinical disease diagnosis, outliers send a signal in many organizations and processes such as financial institutions (Murugavel & Punithavalli, 2011). The rest of the paper is organized as follows; robust estimation was addressed in Section 2, accurate estimation on the high breakdown point was addressed in Section 3. Partitioning algorithm discussed in Section 4. We discuss the simulation procedures used in the work in Section 5, we discuss parameter estimation in Section 6, results in Section 7, discussion of findings in Section 8, and the paper ended with the concluding remarks in Section 9.

2. Robust estimation

The classical parameter estimator is highly based on assumptions that in reality are difficult to satisfy. The classical mean and covariance measurements are in a given multivariate data set $X_{n\times p}$ , for the $i^{\text{th}}$ observation denoted by $x_{i}={(x_{i1},\ldots,x_{ip})}^{\prime}$ ; where $\bar{x}=\frac{1}{n}\sum^{n}_{i=1}{x_{i}}$ and $S^{2}=\frac{1}{(n-1)}\sum^{n}_{i=1}{(x_{i}-\bar{x}){(x_{i}-\bar{x})}^{\prime}}$ are the mean and variance respectively.

Such measures are seriously influenced by the inclusion in the data set of only a small fraction of outliers (Franklin et al., 2000), which makes the classical estimator unreliable in managing most data sets in real life. Due to the increasingly complex nature of several real-life situations, more particularly in surveys, efficient outlier detection and robust estimation methods are required to handle inconsistent measurements. Ignoring or improperly handling them can cause the forecasting model to neither represent the bulk of the data nor the outliers (Hubert et al., 2008; Davey & Savla, 2010; Rousseeuw & Hubert, 2018; Udea, 2009). An estimator or statistical technique is reliable because it offers valuable information even though it does not apply any of the premises used to validate the estimation process (Fox & Weisberg, 2010). The robust statistical methods are aimed at finding a fit close to the fit we would have without outliers (Verdonck et al., 2011), generating efficient estimates by reducing outliers’ effect while minimizing bias. Robust estimators are commonly used for estimating parameters in different distributions of polluted data; Aydina et al. (2018) looked at a robust estimation of scale and position parameters in shifted Gompertz distribution. The widely used robust estimators include S-estimates (Rousseeuw & Yohai, 1984), MM estimates (Yohai, 1987), the least median of squares (LMS) estimators and the least trimmed square (LTS) estimators (Rousseeuw, 1984) with a breakdown point of up to 50% but low asymptotic efficiency. Other estimators are the high breakdown point combined with Gaussian asymptotic distribution, generalized S-estimators (Rousseeuw & Yohai, 1984) achieving efficiencies of up to 33%. The MM-estimators introduced by Yohai (1987) and the Yohai and Zamar (1988) T-estimators that can achieve arbitrarily high efficiency without losing their 50% breakdown point, but pay for it with increased bias. Ahmed et al. (2014) compared the performance of several robust methods concerning their strengths and weaknesses in terms of efficiency, bias, and ability to accommodate high contamination levels. Many of the robust methods can accommodate outliers of up to 33% and retain efficiency while others can accommodate outliers of up to 50% but lose their efficiency. In practice, only a few estimators can accommodate up to 50% outliers while retaining their efficiency; however, they give highly biased estimates in most cases (Croux et al., 1994). Only when it is clear that the fitted model is accurate do we find high-breakdown point estimators, since these estimates do not allow the diagnosis of model misspecification (Cook et al., 1992).

3. Breakdown point

The breakdown point (Hampel, 1971) is a term used to measure the degree of robustness of an estimator and this is a very common quantitative characteristic. According to (Hubert & Debruyne, 2010; Rousseeuw & Hubert, 2018), the breakdown point calculates the smallest fraction of observations that must be replaced by arbitrary values to get the result to all limits, or similarly, it is the maximum fraction of outliers that can be added to a given sample without spoiling the result.

Donoho and Huber (1983) propose a finite sample breakdown point for a multivariate position estimator as follows; the cycle starts with a random sample $X_{n,m}$ that denotes the collection obtained by replacing $m$ data points $x_{i1},\ldots x_{im}$ including $X_{n}$ with arbitrary values, then a sample data points of size $n$ , i.e. $S_{n}=(x_{1},x_{2},\ldots,x_{n})$ is considered, and $T_{n}$ is used to denote regression estimator. The estimation of the coefficients of regression given by $T_{n}(S_{n})$ and the estimator breakdown point at $S_{n}$ are defined by

$\displaystyle\epsilon^{*}_{n}(T_{n};S_{n})=\frac{1}{n}\min\left\{m\in(1,2,% \ldots,n):\sup_{m}T_{n}(X_{n})-T_{n}(X_{n,m})=+\infty\right\}$ (1)

and for a multivariate estimator of scatter we have

$\displaystyle\epsilon^{*}_{n}(C_{n};X_{n})=\frac{1}{n}\min\left\{m\in(1,2,% \ldots,n):\sup_{m}\max_{i}\{|\log({\lambda}_{i}(C_{n}(X_{n})))\log({\lambda}_{% i}(C_{n}(X_{n,m})))|\}\right\}$ (2)

For $0<{\lambda}_{p}(C_{n})\leqslant\ldots{\lambda}_{1}(C_{n})$ the eigenvalues of $C_{n}$ , that means the scatter estimator is broken if any of its own values can become arbitrary large or close to zero. High breakdown point estimators are robustness-based procedures, which are able to handle highly contaminated data, describing the ability of an estimator to withstand any degree of data contamination (Hampel, 1971). Two of the high breakdown point estimators considered for this analysis are the Least Median of Square Estimator (LMS) (Rousseeuw, 1984), which minimizes the median of square residuals in which Hampel (1975) expressed the definition of Least Median of Square (LMS) which minimizes square size.

$\displaystyle\operatorname{med}\{[Y_{i}-x^{\prime}_{i}t]^{2}\},t\in{\mathbb{R}% }_{p}\text{ and }1\leqslant i\leqslant n$ (3)

Rousseeuw (1984) showed that the estimator had a 50% breakdown point; it estimates $\beta$ consistently with the $\beta$ accuracy rate, which makes it highly efficient. In addition, Rousseeuw (1985) suggested the least trimmed square estimator (LTS), a robust alternative to the standard least square estimator and is given by,

$\displaystyle r^{2}=\sum^{h_{n}}_{i=1}(Y_{i}-x^{\prime}_{i}t)^{2}:=\min,∼{}t% \in\mathbb{R}_{p}$ (4)

where ${(r^{2})}_{1:n}\leqslant{(r^{2})}_{2:n}\leqslant\ldots\leqslant{(r^{2})}_{n:n}$ are the ordered squared residuals (the residuals are first squared and then ordered), this criteria does not count the largest squared residuals thereby allowing the LTS fit to steer clear of outliers. When $h=[n/2]$ , LTS which is a $\beta$ -consistent estimator locates that half of the observations which has the smallest estimated variance, and in that case it has a breakdown point of 50%. The LTS is insensitive to corruption due to outliers if the outliers make up less than 50% of the set (Mount et al., 2014). The LMS estimator on the other hand minimizes the mean squared residual; the advantage LTS has over LMS is that it is statistically more effective than LMS.

4. Partitioning algorithms

Our proposed approach assumes that the high concentration of pollutants divides the population into $k$ strata or classes that are mutually exclusive. We consider three levels of contamination; 30%, 40%, and 50%; therefore, the partitions may have the following structure $(\gamma\%\leqslant[50\text{ to }70]\%\leqslant\delta\%)$ in which the proportion of outliers is below or above the minimum and maximum observations of the original data set. The percentage of the contamination $\gamma$ and $\delta$ are such that; $\gamma+\delta\leqslant 50$ percent. In this work, we assume that the contaminants divide the data into two groups which do not overlap $(k=2)$ . We limit our method to one-directional contamination, which breaks the data into two mutual subpopulations of size $N_{1}$ and $N_{2}$ belonging to specific communities where $N_{1}+N_{2}=N$ is of course stratified by the population.

5. Simulation

Simulations and parameter estimation was performed in R (R Core Team, 2018). We simulate a population of size 10,000 units of multivariate data set (one dependent and four explanatory variables), and draw random samples of sizes $n_{1}=50$ , ${n}_{2}=200$ and $n_{3}=500$ from the population and the response variable was contaminated at 30%, 40% and 50% levels yielding 27 scenarios. A random seed was set for all the simulation and sampling scenarios to ensure reproducibility of results, and we flag the outliers for identification and comparison purposes. The contaminated samples were analysed using two high breakdown point estimators, the least median of squares (LMS) and the least trimmed squares (LTS). For the proposed method (Nmeth), we first contaminate the 10,000 simulated units at 30%, 40%, and 50% levels. We believe that the high level of contamination splits the population into two distinct sub-populations of size $N_{1}$ and $N_{2}$ , with either 70/30%, 60/40% or 50/50% stratum population sizes respectively, where the first stratum contains the original measurements that are less or equal to the limit of the pre-contamination simulated data. The second stratum contains all the contaminated units, which are now larger than the maximum set in the original. Stratification is used subject to certain precision limitations, to achieve cost minimization or variability in sampling (Barcaroli 2014). Khan and Wesolowski (2019) considered the question of allocating samples in two and three stages with a view to optimizing multi-domain and population-efficient allocation. Their approach minimizes relative variances in all domains (controlled by given priority weights) as well as the overall relative variance under total constraints (expect) such an approach allows for solutions in multi-stage stratified simple random sampling without replacement (srswor) schemes which are direct generalization of the Neyman-type allocation in multi-stage stratified srswor scheme.

We use simple random sampling without replacement (srswor) within each stratum under Neyman allocation to obtain samples of size ${n}_{1}$ and ${n}_{2}$ where ${n}_{1}+{n}_{2}=n$ . Our assumption is to optimize allocation by minimizing the relative variances in each stratum while taking in to account the sampling weights. The proposed method is easy to implement and can be used with any sample size that can be stratified into two or more non-overlapping groups containing at least two measurements. Table 1 shows the distribution of sample sizes drawn from the stratified population. Having selected the samples under the various contamination levels, we use the simple ordinary least squares (OLS) to fit the regression model to the data. Our assumption here is that the sample is optimally constructed to ensure the OLS is working effectively.

Table 1
Stratum sizes under neyman allocation with SRSWOR

Contamination level
	30%		40%		50%
Sample	Stratum 1	Stratum 2	Stratum 1	Stratum 2	Stratum 1	Stratum 2
50	19	31	16	34	14	36
200	76	124	64	136	54	146
500	191	309	159	341	136	364

Table 2

Parameter estimates under LMS, LTS and Nmeth

Contamination
		30%			40%			50%
		Estimators			Estimators			Estimators
Sample	Coef	LMS	LTS	Nmeth	LMS	LTS	Nmeth	LMS	LTS	Nmeth
50	$\beta_{0}$	$-$ 7.4673	18.42169	1414.267	169.816	6.1141	$-$ 433.169	21.1107	$-$ 153.412	$-$ 2286.16
	$\beta_{1}$	$-$ 0.2631	0.66011	11.949	$-$ 0.578	0.4982	16.163	0.9784	0.5218	5.483
	$\beta_{2}$	1.3702	0.09753	3.595	1.281	$-$ 0.2996	$-$ 8.474	1.2843	$-$ 1.1307	4.989
	$\beta_{3}$	6.0116	4.16713	3.630	6.046	4.132	25.027	2.9942	4.9366	25.123
	$\beta_{4}$	0.1679	$-$ 0.15708	$-$ 27.756	$-$ 3.231	0.3722	5.774	$-$ 0.3686	4.0705	44.692
	AIC	701.7	702.7	704.7168	717.3	715	714.1782	733.8	730.9	702.4988
200	$\beta_{0}$	$-$ 222.526	$-$ 49.3399	649.08	7.0679	$-$ 33.3302	344.092	$-$ 204.535	5.0837	$-$ 1148.39
	$\beta_{1}$	1.6841	1.1072	5.195	1.1004	1.0933	3.027	1.1987	0.8528	5.7525
	$\beta_{2}$	0.4495	0.9778	3.979	2.0237	0.6415	6.506	$-$ 0.1877	0.6680	3.7458
	$\beta_{3}$	1.9736	2.178	8.166	1.2396	2.367	8.399	1.1767	2.0862	$-$ 0.7235
	$\beta_{4}$	4.4696	1.0821	$-$ 11.873	$-$ 0.2328	0.8982	$-$ 5.930	4.4478	0.3644	24.2377
	AIC	2776.1	2774.4	2830.167	2852.2	2852.9	2803.784	2911.1	2897.8	2815.43
500	$\beta_{0}$	64.0042	$-$ 61.8821	$-$ 131.981	145.0192	$-$ 75.1913	259.555	45.558	$-$ 47.8165	$-$ 522.795
	$\beta_{1}$	0.8308	0.9829	5.033	0.8656	0.9743	5.046	0.7855	0.9772	4.691
	$\beta_{2}$	1.5989	1.0635	4.812	0.9318	0.9718	4.228	1.7017	0.8187	4.959
	$\beta_{3}$	2.2509	2.1239	10.471	0.6643	1.7989	13.093	0.3170	1.5254	2.121
	$\beta_{4}$	$-$ 1.3791	1.3202	3.512	$-$ 2.6713	1.6748	$-$ 4.917	$-$ 0.7942	1.244	11.447
	AIC	6968.8	6963.8	7070.05	7151.8	7138.5	7064.944	7296.5	7284.7	6947.975

6. Linear regression model parameters

For the estimation of the breakdown point, we use linear regression with arbitrary error distribution to estimate parameters under LMS and LTS of Rousseeuw and Yohai (1984). The various models for regression are;

$\displaystyle y_{i}=x^{\prime}_{i}\beta+\varepsilon_{i},\text{ for }i=1,\ldots,n$ (5)

If $y_{i}$ is an observed response, $x_{i}$ is a vector of explanatory variables in dimensions $p\times 1$ and $\beta$ is a $p\times 1$ vector of unknown parameters. Classically $\varepsilon_{i},i=1,\ldots,n$ is assumed to be independent and distributed in the same way as $N(0,\sigma^{2})$ , for any $\sigma^{2}>0$ , the regression class is defined as;

$\displaystyle\hat{\beta}=\mathop{\operatorname{arg\,min}}\limits_{\beta}\sum^{% k}_{i=1}{w_{i}\rho({|r(\beta)|}_{(i)})}$ (6)

where ${|r(\beta)|}_{(1)}\leqslant{|r(\beta)|}_{(2)}\leqslant\ldots\leqslant{|r(\beta% )|}_{(n)}$ are the ordered absolute values of the regression residuals $r_{i}=y_{i}-x^{\prime}_{i}\beta$ , $w_{i}$ is the weights, and $\rho$ is a strictly increasing continuous function, such that $\rho(0)=0$ . The breakdown of the regression estimators Eq. (5) is equal to $(n-k)/n$ if $w_{i}\geqslant 0$ for $i=1,\ldots,n$ , $w_{k}>0$ for $k=\max\{i:w_{i}>0\}$ , the index $k$ is within the boundaries $\frac{(n+p+1)}{2}\leqslant k\leqslant n-p-1$ , $n\geqslant 3(p+1)$ and the general position of the data points $x_{i}\in R^{p}$ .

7. Simulation results

In this section, we present calculated results from all the simulation scenarios used to test LMS, LTS and NewMeth’s efficiency according to the proposed design in Table 2. We fit regression lines to the data using all three techniques and, the Akaike Knowledge Criterion (AIC) calculated for each simulation scenario in order to see which model works best. We used a linear regression model to determine the intensity of each model that is highly resistant to outliers.

8. Discussion

Table 2 results show the efficiency of the two high breakdown point estimators (LMS and LTS) and our proposed method (Nmeth) at different contamination rates and sample sizes. Based on the various scenarios, we compare the AIC values for each estimator; with sample size 50 units, the output of both LMS and LTS for 30%, 40% and 50% contamination rates are not substantially different, they remain the same with minimal variations.

For sample size of 50 units, our proposed method with 30% contamination did not perform well since $\text{AIC}_{\text{Nmeth}}=$ 704.7168 $>\text{AIC}_{\text{LMS}}=$ 701.7 and $\text{AIC}_{\text{LTS}}=$ 702.712 , but with 40% contamination level $\text{AIC}_{\text{Nmeth}}=$ 714.1782 $<\text{AIC}_{\text{LMS}}=$ 717.3 and $\text{AIC}_{\text{LTS}}=$ 715, and with 50% contamination level; $\text{AIC}_{\text{Nmeth}}=$ 702.4988 $<\text{AIC}_{\text{LMS}}=$ 733.8 and $\text{AIC}_{\text{LTS}}=$ 730.9. Therefore, our proposed approach is better suited to the data than both LMS and LTS, meaning our method performed better with contamination levels of 40% and 50%. LMS and LTS performed better than out-of-the-box approach under 30% contamination for the sample size of 200 units, the performance of LMS and LTS however does not vary significantly from each other. With higher levels of contamination for the sample size 200 units, that is, 40% contamination level; $\text{AIC}_{\text{Nmeth}}=$ 2803.784 $<\text{AIC}_{\text{LMS}}=$ 2852.2 and $\text{AIC}_{\text{LTS}}=$ 2852.9 and 50% contamination level; $\text{AIC}_{\text{Nmeth}}=$ 2815.43 $<\text{AIC}_{\text{LMS}}=$ 2911.1 and $\text{AIC}_{\text{LTS}}=$ 2897.8, showing that our method performed better than the two robust high breakdown point estimators did, but both performed equally the same. Similarly, the output trend for the sample size of 500 units remains the same with our method performing worst compared to the two methods with 30% contamination where $\text{AIC}_{\text{Nmeth}}=$ 7070.05 $>\text{AIC}_{\text{LMS}}=$ 6968.8 and $\text{AIC}_{\text{LTS}}=$ 6963.8. However, at 40% contamination level; $\text{AIC}_{\text{Nmeth}}=$ 7064.944 $<\text{AIC}_{\text{LMS}}=$ 7151.8 and $\text{AIC}_{\text{LTS}}=$ 7138.5. Similarly, at 50% contamination level; $\text{AIC}_{\text{Nmeth}}=$ 6947.975 $<\text{AIC}_{\text{LMS}}=$ 7296.5 and $\text{AIC}_{\text{LTS}}=$ 7284.7 indicating a better performance by the proposed method than the two high breakdown point estimators.

9. Concluding remarks

In this paper, we present alternative method to high breakdown point estimators for handling highly polluted data set by using stratified sampling with Neyman allocation, taking into account the variance within the stratum and the homogeneity of the measurements for optimum estimation. This study aimed to explore new, easy-to-use, accurate, and robust methods for addressing highly polluted data set to high breakdown point estimators that can effectively accommodate outliers of up to 50 per cent. The AIC values for sample sizes 50, 200, and 500 for LMS, LTS, and Nmeth indicate that LMS and LTS performed better than Nmeth when fitting a robust regression line to the 30% contaminated data, but LMS and LTS performed similarly. For 40% and 50% contamination rates, Nmeth performed better than LMS and LTS generating smaller AIC values under the sample sizes of 50, 200 and 500 units. It is necessary to remember that the LMS and LTS yield the same results in all variations of the sample sizes and contamination levels. The analysis shows that our proposed approach (Nmeth) worked well when compared with LMS and LTS at contamination rates between 40% and 50%. This suggests that besides the two high breakdown point estimators (LMS and LTS) we may use Nmeth as an alternative in addressing highly contaminated data collection. We evaluate the performance of our proposed method using only simulated data but it is expected to perform effectively in real-life scenarios and produce comparable outcomes. The accuracy and effectiveness of the proposed approach may be of interest to the statistical analysis community at large.

References

Ahmed

A. S.

Moustafa

George

, & Kibria

B. M. G.

(2014). A comparison of some robust bicariate control charts for individual observations. International Journal for Quality Research, 8(2), 183-196. http://digitalcommons.fiu.edu/math.fac/12.

Aydin

Akgul

F. G.

, & Senoglu

(2018). Robust estimation of the location and the scale parameters of shifted Gompertz distribution. Electronic Journal of Applied Statistical Analysis, 11(1), 92-107. doi: 10.1285/i20705948v11n1p92.

Barcaroli

(2014). SamplingStrata: An R package for the optimization of stratified sampling. Journal of Statistical Software, 61(4). http://www.jstatsoft.org/.

Barnett

, & Lewis

(1995). Outliers in statistical data. 3rd Edition. John Wiley & Sons. https://doi.org/10.1002/bimj.4710370219.

Béguin

, & Hulliger

(2008). The BACON-EEM algorithm for multivariate outlier detection in incomplete survey data. Survey Methodology, 34(1), 91-103.

Burke

(1988). Missing values, outliers, robust statistics, and non-parametric methods. LC•GC europe online supplement. Scientific Data Management, 2(2), 19-24.

Cook

R. D.

Hawkins

D. M.

, & Weisberg

(1992). Comparison of model misspecification diagnostics using residuals from least mean of squares and least median of squares fits. Journal of the American Statistical Association, 87(418), 419-424.

Croux

Rousseeuw

P. J.

, & Hossjer

(1994). Generalized S-estimators. Journal of the American Statistical Association, 89(428), 1271-1281.

Davey

, & Savla

(2010). Statistical power analysis with missing data: A structural equation modeling approach, NY: Routledge. 47-65.

10.

Donoho

D. L.

, & Huber

P. I.

(1983). The notion of breakdown-point. in: Festschriftfur Erich, L. Lehmann Bickel, Doksum and Hodges, eds, Wadsworth, Belmont, CA. 1983.

11.

Elliot

R. M.

, & Stettler

(2006). Using a mixture model for multiple imputations in the presence of outliers: The healthy for life project. Applied Statistics, 56(1), 63-78

12.

Fisher

, & Waclawski

(2009). A survey of techniques for identifying and handling outliers and missing values in time series data. 29𝑡ℎ International Symposium on Forecasting, Hong Kong. ww.forecasters.org/isf.

13.

Franklin

Brodeur

, & Thomas

(2000). Robust multivariate outlier detection using Mahalanobis’ distance and Stahel-Donoho estimators. in: ICES-II, International Conference on Establishment Surveys-II.

14.

Hampel

F. R.

(1971). A general qualitative definition of robustness. Annals of Mathematical Statistic, 42, 1887-1896.

15.

Hampel

F. R.

(1975). Beyond location parameters: Robust concepts and methods. Bulletin of the International Statistical Institute, 46, 375-382.

16.

Hubert

, & Debruyne

(2010). Minimum covariance determinant. Wiley Interdisciplinary Review: Computational Statistics, 2, 36-43.

17.

Hubert

Rousseeuw

P. J.

, & Aelst

S. V.

(2008). High-breakdown robust multivariate methods. Statistical Science, 23(1), 92-119. doi: 10.1214/088342307000000087.

18.

Khan

M. G. M.

, & Wesołowski

(2019). Neyman-type sample allocation for domains-efficient estimation in multistage sampling. AStA Advances in Statistical Analysis, 103, 563-592. doi: 10.1007/s10182-018-00340-2.

19.

Mount

D. N.

Netanyahu

N. S.

Piatko

D. C.

Silverman

, & Wu

A. Y.

(2014). On the least trimmed squares estimator. Algorithmica, 69, 148-183. doi: 10.1007/s00453-012-9721-8.

20.

Murugavel

, & Punithavalli

(2011). Improved hybrid clustering and distance-based technique for outlier removal. International Journal on Computer Science and Engineering (IJCSE), 3(1), 333-339.

21.

Rousseeuw

P. J.

(1984). Least median of squares regression. Journal of American Statistical Association, 79, 871-880.

22.

Rousseeuw

P. J.

, & Huber

(2018). Anomaly detection by robust statistics. WIREs Data Mining Knowl Discov, 8, 1-14. doi: 10.1002/widm.1236.

23.

Rousseeuw

P. J.

, & Yohai

(1984). Robust regression by means of S-estimators. in: Robust and Nonlinear Time Series Analysis Franke

Hardle

, & Martin

R. D.

eds, 256-275, Lecture Notes in Statistics, 26. Springer-Verlag, New York.

24.

R Core Team (2018). R: A Language and Environment for Statistical Computing. Vienna Austria: R Foundation for Statistical computing. Available at: https://www.R-project.org/.

25.

Ramsey

P. H.

, & Ramsey

P. P.

(2007). Optimal trimming and outlier elimination. Journal of Modern Applied Statistical Methods, 6(2), 355-360.

26.

Rousseeuw

P. J.

(1985). Multivariate estimation with a high breakdown point. in: Mathematical Statistics and Application, Vol B, Reidel, Dordrecht. Grossmann

Pflug

Vincze

, & Wertz

eds. 283-297.

27.

Todorov

Templ

, & Filzmoser

(2011). Detection of multivariate outliers in business survey data with incomplete information. Advanced-Data Analysis and Classification, 5, 37-56

28.

Ueda

(2009). A simple method for the detection of outliers. Electronic Journal of Applied Statistical Analysis, 2(1), 67-76.

29.

van der Linde

, & Houle

(2006). Applied usage of the minimum-volume ellipsoid. Unpublished Manuscript, Department of Biological Science, Florida State University, Tallahassee, FL.

30.

Verdonck

Hubert

, & Rousseeuw

(2011). Robust covariance estimation for financial applications. EURANDOM-ISI Workshop on Actuarial and Financial Statistics, Eindhoven, 29-30.

31.

Wilcox

(2011). Introduction to robust estimation and hypothesis testing. (3rd ed), San Diego: Academic Press. 227-228.

32.

Yohai

V. Y.

(1987). High breakdown-point and high-efficiency robust estimators for regression. Annals of Statistics, 15, 642-656

33.

Yohai

V. Y.

, & Zamar

R. H.

(1988). High breakdown-point estimates of regression by means of the minimum of an efficient scale. Journal of the American Statistical Association, 83, 406-413.

Stratified sampling in highly polluted data as an effective and reliable alternative to high breakdown point estimators

Abstract

Keywords

1. Introduction

2. Robust estimation

3. Breakdown point

5. Simulation

Table 1 Stratum sizes under neyman allocation with SRSWOR

8. Discussion

9. Concluding remarks

References

Table 1
Stratum sizes under neyman allocation with SRSWOR