Abstract
Observations on certain real-life cases include units that are incompatible with other data sets. Values that are extreme in nature do influence estimates obtained by conventional estimators. Robust estimators are therefore necessary for efficient estimation of parameters. This paper uses stratification with simple random sampling without replacement to optimize sample allocation in stratum for efficient parameter estimation as an alternative method of handling highly contaminated samples. Our proposed method stratifies the highly contaminated population into two non-overlapping sub-populations, and stratified samples of sizes 50, 200, and 500 was drawn. We estimate the model parameters form the contaminated sampled data using ordinary least squares under the proposed method, and using the two high breakdown point estimators; the Least Median of Squares and Least Trimmed Squares. Our findings shows that the proposed method did not perform well for low contamination levels (
Introduction
Studies involving several real-life scenarios contains interest measurements that vary from the rest of the results, which are often called measurement errors. Some of these inconsistent measurements, however, are genuinely calculated observations that exhibit characters that differ from the rest of the data; we find those measurements to belong to another population. The presence of incoherent observations called outliers affects many statistical procedures and the validity of their results, especially when the distribution underlying them is normal. Nonetheless, there are many outlier concepts (Barnett & Lewis, 1995; Todorov et al., 2011; Ramsy & Ramsy, 2007); outliers are pure measurements (not necessary errors) that are incompatible with most of the data and do not behave in similar mode.
Outliers can occur for reasons such as; errors in data transmission or transcription, data coding and entry, ineffective survey design and processes, defective equipment or irregular spikes or, time-series data droppings. They can also occur because of natural causes as certain individuals can demonstrate characters well above or below the average. For example, some individuals may die earlier or live longer in life than most of their contemporaries in a given community (Fisher & Waclawski, 2009). Nevertheless, extreme natural events are uncommon, but in fact, they are representative of nature since they are correctly measured. Non-representative outliers; must be detected and rectified throughout the data editing process. Outliers with multiple measurements and/or variables are more likely to occur in datasets. A quick visual inspection still does not make them easy to detect (Hubert et al., 2008). Outliers introduce effects of masking which obscures real effects. Even a small cluster of observations may inflate the empirical covariance matrix to introduce the non-existent effect called swamping which makes some observations look like outliers when the true outliers pull the empirical matrix away from the non-outliers (van der Linde & Houle, 2006; Wilcox, 2011; Elliot & Stettler, 2006).
When assumptions of normality are broken due to the existence of outliers, the classical multivariate parameter estimation methods do not work well, because they are focused on; the empirical means, covariance and correlation matrices, and the least square fitting; all of which are highly influenced by only a few outliers that make them unstable (Croux et al., 1994). When addressing outliers’ problems, it is sometimes tempting to replace extreme values with plausible ones assumed by the researcher or remove them outright from the data set; however, extreme values are only removed if there are sufficient technical reasons for their aberrant behavior (Burke, 1988). Statistical criteria alone are not adequate to exclude extreme values from a data set, says Béguin and Hulliger (2008). For example, in fraud detection, network intrusion detection, and clinical disease diagnosis, outliers send a signal in many organizations and processes such as financial institutions (Murugavel & Punithavalli, 2011). The rest of the paper is organized as follows; robust estimation was addressed in Section 2, accurate estimation on the high breakdown point was addressed in Section 3. Partitioning algorithm discussed in Section 4. We discuss the simulation procedures used in the work in Section 5, we discuss parameter estimation in Section 6, results in Section 7, discussion of findings in Section 8, and the paper ended with the concluding remarks in Section 9.
Robust estimation
The classical parameter estimator is highly based on assumptions that in reality are difficult to satisfy. The classical mean and covariance measurements are in a given multivariate data set
Such measures are seriously influenced by the inclusion in the data set of only a small fraction of outliers (Franklin et al., 2000), which makes the classical estimator unreliable in managing most data sets in real life. Due to the increasingly complex nature of several real-life situations, more particularly in surveys, efficient outlier detection and robust estimation methods are required to handle inconsistent measurements. Ignoring or improperly handling them can cause the forecasting model to neither represent the bulk of the data nor the outliers (Hubert et al., 2008; Davey & Savla, 2010; Rousseeuw & Hubert, 2018; Udea, 2009). An estimator or statistical technique is reliable because it offers valuable information even though it does not apply any of the premises used to validate the estimation process (Fox & Weisberg, 2010). The robust statistical methods are aimed at finding a fit close to the fit we would have without outliers (Verdonck et al., 2011), generating efficient estimates by reducing outliers’ effect while minimizing bias. Robust estimators are commonly used for estimating parameters in different distributions of polluted data; Aydina et al. (2018) looked at a robust estimation of scale and position parameters in shifted Gompertz distribution. The widely used robust estimators include S-estimates (Rousseeuw & Yohai, 1984), MM estimates (Yohai, 1987), the least median of squares (LMS) estimators and the least trimmed square (LTS) estimators (Rousseeuw, 1984) with a breakdown point of up to 50% but low asymptotic efficiency. Other estimators are the high breakdown point combined with Gaussian asymptotic distribution, generalized S-estimators (Rousseeuw & Yohai, 1984) achieving efficiencies of up to 33%. The MM-estimators introduced by Yohai (1987) and the Yohai and Zamar (1988) T-estimators that can achieve arbitrarily high efficiency without losing their 50% breakdown point, but pay for it with increased bias. Ahmed et al. (2014) compared the performance of several robust methods concerning their strengths and weaknesses in terms of efficiency, bias, and ability to accommodate high contamination levels. Many of the robust methods can accommodate outliers of up to 33% and retain efficiency while others can accommodate outliers of up to 50% but lose their efficiency. In practice, only a few estimators can accommodate up to 50% outliers while retaining their efficiency; however, they give highly biased estimates in most cases (Croux et al., 1994). Only when it is clear that the fitted model is accurate do we find high-breakdown point estimators, since these estimates do not allow the diagnosis of model misspecification (Cook et al., 1992).
Breakdown point
The breakdown point (Hampel, 1971) is a term used to measure the degree of robustness of an estimator and this is a very common quantitative characteristic. According to (Hubert & Debruyne, 2010; Rousseeuw & Hubert, 2018), the breakdown point calculates the smallest fraction of observations that must be replaced by arbitrary values to get the result to all limits, or similarly, it is the maximum fraction of outliers that can be added to a given sample without spoiling the result.
Donoho and Huber (1983) propose a finite sample breakdown point for a multivariate position estimator as follows; the cycle starts with a random sample
and for a multivariate estimator of scatter we have
For
Rousseeuw (1984) showed that the estimator had a 50% breakdown point; it estimates
where
Our proposed approach assumes that the high concentration of pollutants divides the population into
Simulation
Simulations and parameter estimation was performed in R (R Core Team, 2018). We simulate a population of size 10,000 units of multivariate data set (one dependent and four explanatory variables), and draw random samples of sizes
We use simple random sampling without replacement (srswor) within each stratum under Neyman allocation to obtain samples of size
Stratum sizes under neyman allocation with SRSWOR
Stratum sizes under neyman allocation with SRSWOR
Parameter estimates under LMS, LTS and Nmeth
For the estimation of the breakdown point, we use linear regression with arbitrary error distribution to estimate parameters under LMS and LTS of Rousseeuw and Yohai (1984). The various models for regression are;
If
where
In this section, we present calculated results from all the simulation scenarios used to test LMS, LTS and NewMeth’s efficiency according to the proposed design in Table 2. We fit regression lines to the data using all three techniques and, the Akaike Knowledge Criterion (AIC) calculated for each simulation scenario in order to see which model works best. We used a linear regression model to determine the intensity of each model that is highly resistant to outliers.
Discussion
Table 2 results show the efficiency of the two high breakdown point estimators (LMS and LTS) and our proposed method (Nmeth) at different contamination rates and sample sizes. Based on the various scenarios, we compare the AIC values for each estimator; with sample size 50 units, the output of both LMS and LTS for 30%, 40% and 50% contamination rates are not substantially different, they remain the same with minimal variations.
For sample size of 50 units, our proposed method with 30% contamination did not perform well since
Concluding remarks
In this paper, we present alternative method to high breakdown point estimators for handling highly polluted data set by using stratified sampling with Neyman allocation, taking into account the variance within the stratum and the homogeneity of the measurements for optimum estimation. This study aimed to explore new, easy-to-use, accurate, and robust methods for addressing highly polluted data set to high breakdown point estimators that can effectively accommodate outliers of up to 50 per cent. The AIC values for sample sizes 50, 200, and 500 for LMS, LTS, and Nmeth indicate that LMS and LTS performed better than Nmeth when fitting a robust regression line to the 30% contaminated data, but LMS and LTS performed similarly. For 40% and 50% contamination rates, Nmeth performed better than LMS and LTS generating smaller AIC values under the sample sizes of 50, 200 and 500 units. It is necessary to remember that the LMS and LTS yield the same results in all variations of the sample sizes and contamination levels. The analysis shows that our proposed approach (Nmeth) worked well when compared with LMS and LTS at contamination rates between 40% and 50%. This suggests that besides the two high breakdown point estimators (LMS and LTS) we may use Nmeth as an alternative in addressing highly contaminated data collection. We evaluate the performance of our proposed method using only simulated data but it is expected to perform effectively in real-life scenarios and produce comparable outcomes. The accuracy and effectiveness of the proposed approach may be of interest to the statistical analysis community at large.
